boyuanzz commented on pull request #13592: URL: https://github.com/apache/beam/pull/13592#issuecomment-749748837
> > btw I tested it by using Kafka + Dataflow streaming and I don't notice a performance improvement there. It might be that the cost of creating Kafka connection is really cheap. > > That doesn't seem to be a surpise, because under the current implementation, it is essential for CheckpointMark to correctly implement equals and hashCode (which KafkaCheckpointMark does not), because between two successive calls to `processElement` the checkpoint is stored in state and therefore serialized and deserialized and so a new object is put into the cache. I verified that the reader is reused from cache in Kafka case manually. > Second point is that, even after we fix this, it will be probably noticeable only on pipelines with very frequent checkpoints. It makes me feel like configuring split frequency from PipelineOption or Read API still has value. If comparing the way between DirectRunner invokes Unbounded Read and Unbounded SDF, the checkpoint happens every 100 elements for Unbounded Read but almost every second for SDF. Besides, SDF uses timers and states to feed back checkpoint to process, which brings more overhead for processing. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
