boyuanzz commented on pull request #13592:
URL: https://github.com/apache/beam/pull/13592#issuecomment-749748837


   > > btw I tested it by using Kafka + Dataflow streaming and I don't notice a 
performance improvement there. It might be that the cost of creating Kafka 
connection is really cheap.
   > 
   > That doesn't seem to be a surpise, because under the current 
implementation, it is essential for CheckpointMark to correctly implement 
equals and hashCode (which KafkaCheckpointMark does not), because between two 
successive calls to `processElement` the checkpoint is stored in state and 
therefore serialized and deserialized and so a new object is put into the 
cache. 
   
   I verified that the reader is reused from cache in Kafka case manually. 
   
   > Second point is that, even after we fix this, it will be probably 
noticeable only on pipelines with very frequent checkpoints.
   
   It makes me feel like configuring split frequency from PipelineOption or 
Read API still has value. If comparing the way between DirectRunner invokes 
Unbounded Read and Unbounded SDF, the checkpoint happens every 100 elements for 
Unbounded Read but almost every second for SDF. Besides, SDF uses timers and 
states to feed back checkpoint to process, which brings more overhead for 
processing.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to