[jira] [Work logged] (BEAM-11403) Unbounded SDF wrapper causes performance regression on DirectRunner

ASF GitHub Bot (Jira) Tue, 22 Dec 2020 12:22:06 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-11403?focusedWorklogId=527350&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-527350
 ]


ASF GitHub Bot logged work on BEAM-11403:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 22/Dec/20 20:21
            Start Date: 22/Dec/20 20:21
    Worklog Time Spent: 10m 
      Work Description: boyuanzz commented on pull request #13592:
URL: https://github.com/apache/beam/pull/13592#issuecomment-749757695


   > > I verified that the reader is reused from cache in Kafka case manually.
   > 
   > Hm, are you sure? That confuses me, because looking at the code, I'm not 
sure how that could function with `System.identityHashCode` as hashCode for 
CheckpointMark. Provided that AutoValue delegates its hashCode as would be 
expected and that guava's Cache uses hashCode. Hm, maybe Dataflow is not using 
SplittableDoFnViaKeyedWorkItems and has some specific implementation? 
   
   All changes are in SDK side, as well as the cache creation, So it will be 
independent from runner executions(and I'm using beam_fn_api as well).
   
   But you are right that `hashCode` is not correct if it is not implemented 
correctly.
   
   > > It makes me feel like configuring split frequency from PipelineOption
   > 
   > Sure, Flink has such an option. It would be natural to either create one 
for generic use, or add that to respective runner's PipelineOptions.
    
   The checkpoint for SDF is different from Flink checkpoint. That says, even 
the checkpoint interval is not configured for Flink, the checkpoint for SDF 
will still happen based on how 
`OutputAndTimeBoundedSplittableProcessElementInvoker` is configured.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 527350)
    Time Spent: 2h 40m  (was: 2.5h)

> Unbounded SDF wrapper causes performance regression on DirectRunner
> -------------------------------------------------------------------
>
>                 Key: BEAM-11403
>                 URL: https://issues.apache.org/jira/browse/BEAM-11403
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-direct, sdk-java-core
>            Reporter: Boyuan Zhang
>            Assignee: Boyuan Zhang
>            Priority: P2
>          Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> There is a significant performance regression when switching from 
> UnboundedSource to Unbounded SDF wrapper. So far there are 2 IOs reported:
> * Pubsub Read: 
> https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E
> * Kafka Read: https://the-asf.slack.com/archives/C9H0YNP3P/p1606155042346600



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-11403) Unbounded SDF wrapper causes performance regression on DirectRunner

Reply via email to