I had opened a jira years ago [1] about this, but would like to actually
fix it for real now, given that our users have started using streaming more
and more.

There's more detail in the jira, but basically side inputs in streaming
pipelines on dataflow lead to pretty bad performance because they result in
a state lookup request for each element.

I think the best solution would be to stop storing null for the blockedMap,
and instead store and empty map.  This way there will (generally) be a
cache hit when looking it up (the cache can't cache a null).  The only
issue here though is that something then needs to clean up the map.  It
seems like the StreamingSideInputDoFnRunner could probably set its own
cleanup timer here to do that.

I'm curious if there are other thoughts on the better ways to do this?

Also, as a side note, I was looking at the SideInputHandler implementation
in runners-core, and I can't seem to see where the state that it maintains
is ever cleaned up?

[1] https://issues.apache.org/jira/browse/BEAM-7745

Reply via email to