gaborgsomogyi commented on a change in pull request #24590: [SPARK-27687][SS] 
Rename Kafka consumer cache capacity and document caching
URL: https://github.com/apache/spark/pull/24590#discussion_r283640131
 
 

 ##########
 File path: docs/structured-streaming-kafka-integration.md
 ##########
 @@ -416,6 +416,24 @@ The following configurations are optional:
 </tr>
 </table>
 
+### Consumer Caching
+
+It's time consuming to initialize Kafka consumers especially in streaming 
scenarios where processing time is key factor.
+Because of this Spark caches Kafka consumers on executors. The caching key 
built-up from the following information:
+* Topic name
+* Topic partition
+* Group ID
+
+The size of the cache is limited by 
<code>spark.kafka.consumer.cache.capacity</code> (default: 64).
+If the mentioned threshold reached it tries to remove the least-used entry 
which is currently not in use.
+If it cannot be removed, then the cache will keep growing. In the worst case, 
the cache will grow to
+the max number of concurrent tasks that can run in the executor, (that is, 
number of tasks slots)
+after which it will never reduce.
+
+If a task is failed for any reason the new task executed with a newly created 
Kafka consumer for safety reasons.
 
 Review comment:
   Fixed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to