Steve Niemitz created BEAM-7745:
-----------------------------------

             Summary: StreamingSideInputDoFnRunner/StreamingSideInputFetcher 
have suboptimal state access pattern during normal operation
                 Key: BEAM-7745
                 URL: https://issues.apache.org/jira/browse/BEAM-7745
             Project: Beam
          Issue Type: Improvement
          Components: runner-dataflow
            Reporter: Steve Niemitz


I spent some time tracking down sources of uncached state fetches in my job, 
and one large category was the interaction of StreamingSideInputDoFnRunner + 
StreamingSideInputFetcher.

Basically, during standard operations, when the main input is NOT blocked by 
the side input, the side input fetcher will perform an uncached state read for 
every input element.  Changing it to cache the blockedMap state gave me a 
~30-40% increase in throughput in my job.

The interaction is a little complicated, and there's a couple optimizations 
here I can see.

 

Primarily, the blockedMap is only persisted if it is non-empty.  Because the 
WindmillStateCache won't cache a null value, this means that the "nothing is 
blocked" signal is never actually cached, and will issue a state read to 
windmill for each input element.  The solution here seems like it is to persist 
an empty map rather than a null when there are no blocked elements.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

Reply via email to