gaborgsomogyi commented on issue #19096: [SPARK-21869][SS] A cached Kafka 
producer should not be closed if any task is using it - adds inuse tracking.
URL: https://github.com/apache/spark/pull/19096#issuecomment-523351056
 
 
   > If we think it should be longer than 10 minutes, let's increase it
   
   I don't think it's realistic to expect the user to measure the average time 
of a batch query and set the timeout accordingly.
   
   > I don't know where the 10 minutes comes from
   
   There is a config named `spark.kafka.producer.cache.timeout` on master, 
previously it was with different name. As a workaround users are setting this 
value to a high number.
   
   > To say that we will never release a resource until a task says it's okay 
is inherently dangerous in a distributed system
   
   True! I'm not telling it's the most safe solution but on the consumer side 
we've similar approach already. Catching committer attention is super hard and 
thought it would be less risky from committer perspective to add ref counting 
than introducing a complete new lib (like Commons Pool). Thinking about 
alternatives I see mainly 2 (pretty sure there are others):
   * Apache Commons Pool would be a good alternative here as well which is 
proposed on the consumer side in https://github.com/apache/spark/pull/22138. 
(my personal preference)
   * We can switch to similar solution just like on the consumer side, see an 
extract from the Kafka integration guide:
   ```
   The size of the cache is limited by 
<code>spark.kafka.consumer.cache.capacity</code> (default: 64).
   If this threshold is reached, it tries to remove the least-used entry that 
is currently not in use.
   If it cannot be removed, then the cache will keep growing. In the worst 
case, the cache will grow to
   the max number of concurrent tasks that can run in the executor (that is, 
number of tasks slots),
   after which it will never reduce.
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to