[ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706788#comment-15706788
 ] 

Burak Yavuz commented on SPARK-18475:
-------------------------------------

I'd be happy to share performance results. You're right, I never tried it with 
SSL on. One thing to note is that I was never planning to have this enabled by 
default, because there is no way to think of a sane default parallelism value.

What I was hoping to achieve was provide Spark users, who may not be Kafka 
experts a "Break in case of emergency" way out. It's easy to say "Partition 
your data properly" to people, until someone upstream in your organization 
changes one thing and the data engineer has to deal with the mess of skewed 
data.

You may want to tell people, "hey increase your Kafka partitions" if you want 
to increase Kafka parallelism, but is that a viable operation when your queues 
are already messed up, and the damage has been already done. Are you going to 
have them empty the queue, delete the topic, create a topic with increased 
number of partitions and re-consume everything so that it is properly 
partitioned again?

It's easy to talk about what needs to be done, and what is the proper way to do 
things until shit hits the fan in production with something that is/was totally 
out of your control and you have to clean up the mess.

> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to