[
https://issues.apache.org/jira/browse/BEAM-11403?focusedWorklogId=527338&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-527338
]
ASF GitHub Bot logged work on BEAM-11403:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 22/Dec/20 20:01
Start Date: 22/Dec/20 20:01
Worklog Time Spent: 10m
Work Description: boyuanzz commented on pull request #13592:
URL: https://github.com/apache/beam/pull/13592#issuecomment-749748837
> > btw I tested it by using Kafka + Dataflow streaming and I don't notice a
performance improvement there. It might be that the cost of creating Kafka
connection is really cheap.
>
> That doesn't seem to be a surpise, because under the current
implementation, it is essential for CheckpointMark to correctly implement
equals and hashCode (which KafkaCheckpointMark does not), because between two
successive calls to `processElement` the checkpoint is stored in state and
therefore serialized and deserialized and so a new object is put into the
cache.
I verified that the reader is reused from cache in Kafka case manually.
> Second point is that, even after we fix this, it will be probably
noticeable only on pipelines with very frequent checkpoints.
It makes me feel like configuring split frequency from PipelineOption or
Read API still has value. If comparing the way between DirectRunner invokes
Unbounded Read and Unbounded SDF, the checkpoint happens every 100 elements for
Unbounded Read but almost every second for SDF. Besides, SDF uses timers and
states to feed back checkpoint to process, which brings more overhead for
processing.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 527338)
Time Spent: 2h (was: 1h 50m)
> Unbounded SDF wrapper causes performance regression on DirectRunner
> -------------------------------------------------------------------
>
> Key: BEAM-11403
> URL: https://issues.apache.org/jira/browse/BEAM-11403
> Project: Beam
> Issue Type: Bug
> Components: runner-direct, sdk-java-core
> Reporter: Boyuan Zhang
> Assignee: Boyuan Zhang
> Priority: P2
> Time Spent: 2h
> Remaining Estimate: 0h
>
> There is a significant performance regression when switching from
> UnboundedSource to Unbounded SDF wrapper. So far there are 2 IOs reported:
> * Pubsub Read:
> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
> * Kafka Read: https://the-asf.slack.com/archives/C9H0YNP3P/p1606155042346600
--
This message was sent by Atlassian Jira
(v8.3.4#803005)