Stephane Maarek created SPARK-20287:
---------------------------------------
Summary: Kafka Consumer should be able to subscribe to more than
one topic partition
Key: SPARK-20287
URL: https://issues.apache.org/jira/browse/SPARK-20287
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 2.1.0
Reporter: Stephane Maarek
As I understand and as it stands, one Kafka Consumer is created for each topic
partition in the source Kafka topics, and they're cached.
cf
https://github.com/apache/spark/blob/master/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/CachedKafkaConsumer.scala#L48
In my opinion, that makes the design an anti pattern for Kafka and highly
unefficient:
- Each Kafka consumer creates a connection to Kafka
- Spark doesn't leverage the power of the Kafka consumers, which is that it
automatically assigns and balances partitions amongst all the consumers that
share the same group.id
- You can still cache your Kafka consumer even if it has multiple partitions.
I'm not sure about how that translates to the spark underlying RDD
architecture, but from a Kafka standpoint, I believe creating one consumer per
partition is a big overhead, and a risk as the user may have to increase the
spark.streaming.kafka.consumer.cache.maxCapacity parameter.
Happy to discuss to understand the rationale
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]