Daniel Mills created BEAM-11417:
-----------------------------------

             Summary: StreamingDataflowWorker can leak UnboundedSource 
finalization callbacks
                 Key: BEAM-11417
                 URL: https://issues.apache.org/jira/browse/BEAM-11417
             Project: Beam
          Issue Type: Improvement
          Components: runner-dataflow
            Reporter: Daniel Mills


StreamingDataflowWorker keeps a map of finalization callbacks 
([https://github.com/apache/beam/blob/master/runners/google-cloud-dataflow-java/worker/src/main/java/org/apache/beam/runners/dataflow/worker/StreamingDataflowWorker.java#L401).]
  If the Dataflow service loses a callback ID (due to autoscaling etc; they are 
best-effort), the callback will stay around forever.

This can cause a relatively rapid memory leak for sources like KafkaIO where 
the callback (the KafkaCheckpointMark) has a reference to the 
KafkaUnboundedReader object, which keeps a KafkaConsumer object alive.

A simple fix would be to change the ConcurrentHashMap to a guava Cache with a 
timeout on elements.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to