[jira] [Comment Edited] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Cody Koeninger (JIRA) Sat, 19 Nov 2016 08:03:06 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-18475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15679472#comment-15679472
 ]


Cody Koeninger edited comment on SPARK-18475 at 11/19/16 4:02 PM:
------------------------------------------------------------------

Yes, an RDD does have an ordering guarantee, it's an iterator per partition, 
same as Kafka.  Yes, that guarantee is part of the Kafka data model (Burak, if 
you don't believe me, go reread 
http://kafka.apache.org/documentation.html#introduction  search for "order").  
Because the direct stream (and the structured stream that uses the same model) 
has a 1:1 correspondence between kafka partition and spark partition, that 
guarantee is preserved.  The existing distortions between the Kafka model and 
the direct stream / structured stream are enough as it is, we don't need to add 
more.



was (Author: c...@koeninger.org):
Yes, an RDD does have an ordering guarantee, it's an iterator per partition, 
same as Kafka.  Yes, that guarantee is part of the Kafka data model (Burak, if 
you don't believe me, go reread 
http://kafka.apache.org/documentation.html#introduction  search for "order").  
Because the direct stream (and the structured stream that uses the same model) 
has a 1:! correspondence between kafka partition and spark partition, that 
guarantee is preserved.  The existing distortions between the Kafka model and 
the direct stream / structured stream are enough as it is, we don't need to add 
more.


> Be able to provide higher parallelization for StructuredStreaming Kafka Source
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-18475
>                 URL: https://issues.apache.org/jira/browse/SPARK-18475
>             Project: Spark
>          Issue Type: Improvement
>          Components: Structured Streaming
>    Affects Versions: 2.0.2, 2.1.0
>            Reporter: Burak Yavuz
>
> Right now the StructuredStreaming Kafka Source creates as many Spark tasks as 
> there are TopicPartitions that we're going to read from Kafka.
> This doesn't work well when we have data skew, and there is no reason why we 
> shouldn't be able to increase parallelism further, i.e. have multiple Spark 
> tasks reading from the same Kafka TopicPartition.
> What this will mean is that we won't be able to use the "CachedKafkaConsumer" 
> for what it is defined for (being cached) in this use case, but the extra 
> overhead is worth handling data skew and increasing parallelism especially in 
> ETL use cases.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18475) Be able to provide higher parallelization for StructuredStreaming Kafka Source

Reply via email to