gaborgsomogyi commented on issue #19096: [SPARK-21869][SS] A cached Kafka producer should not be closed if any task is using it - adds inuse tracking. URL: https://github.com/apache/spark/pull/19096#issuecomment-523351056 > If we think it should be longer than 10 minutes, let's increase it I don't think it's realistic to expect the user to measure the average time of a batch query and set the timeout accordingly. > I don't know where the 10 minutes comes from There is a config named `spark.kafka.producer.cache.timeout` on master, previously it was with different name. As a workaround users are setting this value to a high number. > To say that we will never release a resource until a task says it's okay is inherently dangerous in a distributed system True! I'm not telling it's the most safe solution but on the consumer side we've similar approach already. Catching committer attention is super hard and thought it would be less risky from committer perspective to add ref counting than introducing a complete new lib (like Commons Pool). Thinking about alternatives I see mainly 2 (pretty sure there are others): * Apache Commons Pool would be a good alternative here as well which is proposed on the consumer side in https://github.com/apache/spark/pull/22138. (my personal preference) * We can switch to similar solution just like on the consumer side, see an extract from the Kafka integration guide: ``` The size of the cache is limited by <code>spark.kafka.consumer.cache.capacity</code> (default: 64). If this threshold is reached, it tries to remove the least-used entry that is currently not in use. If it cannot be removed, then the cache will keep growing. In the worst case, the cache will grow to the max number of concurrent tasks that can run in the executor (that is, number of tasks slots), after which it will never reduce. ```
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
