[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15674254#comment-15674254
 ] 

Burak Yavuz commented on SPARK-18475:
-------------------------------------

[~ofirm] Thanks for your comment. I've seen significant performance 
improvements, and here's my explanation on how it happened, and why a simple 
repartition won't help:

You're correct on the fact that for a given group.id, multiple consumers can't 
consume from the same TopicPartition concurrently. Where you get the benefit is 
that for skewed partitions, you don't wait for one Spark task to try and read 
everything from Kafka while all cores wait idle. You achieve better utilization 
because as tasks read less data, they can move on to the second step of the 
computation quicker, and while the first CPU has moved on to the second step 
(writing out to some storage system), your second CPU can start reading from 
Kafka. It kind of helps you pipeline your operations. If you use repartition, 
you're still going to have all your cores wait while that one consumer tries to 
read everything, and then you're going to cause a shuffle on top of it which is 
even worse.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to