[ https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15706788#comment-15706788 ]
Burak Yavuz commented on SPARK-18475: ------------------------------------- I'd be happy to share performance results. You're right, I never tried it with SSL on. One thing to note is that I was never planning to have this enabled by default, because there is no way to think of a sane default parallelism value. What I was hoping to achieve was provide Spark users, who may not be Kafka experts a "Break in case of emergency" way out. It's easy to say "Partition your data properly" to people, until someone upstream in your organization changes one thing and the data engineer has to deal with the mess of skewed data. You may want to tell people, "hey increase your Kafka partitions" if you want to increase Kafka parallelism, but is that a viable operation when your queues are already messed up, and the damage has been already done. Are you going to have them empty the queue, delete the topic, create a topic with increased number of partitions and re-consume everything so that it is properly partitioned again? It's easy to talk about what needs to be done, and what is the proper way to do things until shit hits the fan in production with something that is/was totally out of your control and you have to clean up the mess. > Be able to provide higher parallelization for StructuredStreaming Kafka Source > ------------------------------------------------------------------------------ > > Key: SPARK-18475 > URL: https://issues.apache.org/jira/browse/SPARK-18475 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming > Affects Versions: 2.0.2, 2.1.0 > Reporter: Burak Yavuz > > Right now the StructuredStreaming Kafka Source creates as many Spark tasks as > there are TopicPartitions that we're going to read from Kafka. > This doesn't work well when we have data skew, and there is no reason why we > shouldn't be able to increase parallelism further, i.e. have multiple Spark > tasks reading from the same Kafka TopicPartition. > What this will mean is that we won't be able to use the "CachedKafkaConsumer" > for what it is defined for (being cached) in this use case, but the extra > overhead is worth handling data skew and increasing parallelism especially in > ETL use cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org