[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Ofir Manor (JIRA) Tue, 22 Nov 2016 15:13:14 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15688246#comment-15688246
 ]


Ofir Manor commented on SPARK-18475:
------------------------------------

Cody, for me your are the main gatekeeper for everything Kafka and the main 
Kafka expert, so I wanted your perspective, not Michael's (except the generic 
"order" guarantee, which I still think does not exist).
I thought that if someone did the effort of building, testing and trying to 
contribute it, it is an indication that it hurts in the real world, especially 
when you said it is a repeated request. I guess in many places, getting a read 
access to a potentially huge, shared topic is not the same as having Kafka 
admin rights or being the only or main consumer or being able to easily fix bad 
past decisions around partitions and keys...
Anyway, it is totally up to you, you'll have to maintain it. I personally have 
no use for this feature.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Reply via email to