[GitHub] [spark] gaborgsomogyi commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

GitBox Tue, 03 Sep 2019 00:58:10 -0700

gaborgsomogyi commented on a change in pull request #22138: [SPARK-25151][SS] 
Apply Apache Commons Pool to KafkaDataConsumer
URL: https://github.com/apache/spark/pull/22138#discussion_r320127585


 ##########
 File path: docs/structured-streaming-kafka-integration.md
 ##########
 @@ -430,20 +430,70 @@ The following configurations are optional:
 ### Consumer Caching
 
 It's time-consuming to initialize Kafka consumers, especially in streaming 
scenarios where processing time is a key factor.
-Because of this, Spark caches Kafka consumers on executors. The caching key is 
built up from the following information:
+Because of this, Spark pools Kafka consumers on executors, by leveraging 
Apache Commons Pool.
+
+The caching key is built up from the following information:
+
 * Topic name
 * Topic partition
 * Group ID
 
-The size of the cache is limited by 
<code>spark.kafka.consumer.cache.capacity</code> (default: 64).
-If this threshold is reached, it tries to remove the least-used entry that is 
currently not in use.
-If it cannot be removed, then the cache will keep growing. In the worst case, 
the cache will grow to
-the max number of concurrent tasks that can run in the executor (that is, 
number of tasks slots),
-after which it will never reduce.
+The following properties are available to configure the consumer pool:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td>spark.kafka.consumer.cache.capacity</td>
+  <td>The maximum number of consumers cached. Please note that it's a soft 
limit.</td>
+  <td>64</td>
+</tr>
+<tr>
+  <td>spark.kafka.consumer.cache.timeout</td>
+  <td>The minimum amount of time a consumer may sit idle in the pool before it 
is eligible for eviction by the evictor.</td>
+  <td>5m (5 minutes)</td>
+</tr>
+<tr>
+  <td>spark.kafka.consumer.cache.jmx.enable</td>
+  <td>Enable or disable JMX for pools created with this configuration 
instance. Statistics of the pool are available via JMX instance.
+  The prefix of JMX name is set to 
"kafka010-cached-simple-kafka-consumer-pool".
+  </td>
+  <td>false</td>
+</tr>
+</table>
+
+The size of the pool is limited by 
<code>spark.kafka.consumer.cache.capacity</code>,
+but it works as "soft-limit" to not block Spark tasks.
+
+Idle eviction thread periodically removes some consumers which are not used. 
If this threshold is reached when borrowing,
+it tries to remove the least-used entry that is currently not in use.
+
+If it cannot be removed, then the pool will keep growing. In the worst case, 
the pool will grow to
+the max number of concurrent tasks that can run in the executor (that is, 
number of tasks slots).
 
 Review comment:
   Nit: s/tasks slots/task slots

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gaborgsomogyi commented on a change in pull request #22138: [SPARK-25151][SS] Apply Apache Commons Pool to KafkaDataConsumer

Reply via email to