[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2018-07-15 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/19274 ping @lonelytrooper for @koeninger's comment. Otherwise, let me propose to close this for now. --- - To unsubscribe,

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2018-06-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19274 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19274 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2018-01-18 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19274 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2018-01-08 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19274 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/19274 Can one of the admins verify this patch? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-27 Thread koeninger
Github user koeninger commented on the issue: https://github.com/apache/spark/pull/19274 Search Jira and the mailing list, this idea has been brought up multiple times. I don't think breaking fundamental assumptions of Kafka (one consumer thread per group per partition) is a good

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-27 Thread lonelytrooper
Github user lonelytrooper commented on the issue: https://github.com/apache/spark/pull/19274 Thank you so much for inviting more discussions! --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-27 Thread lonelytrooper
Github user lonelytrooper commented on the issue: https://github.com/apache/spark/pull/19274 I guessed that.. This is true, this feature can not ensure the ordering of data in one Kafka partition, but quite a few applications(like dealing with logs) do not need strict order

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-27 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19274 This is because it is the only way to guarantee the ordering of data in Kafka partition mapping to Spark partition. Maybe some other users took as as an assumption to write the code.

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-27 Thread lonelytrooper
Github user lonelytrooper commented on the issue: https://github.com/apache/spark/pull/19274 Hi Jerry, thank you so much for discussing! Actually, we tried 'repartition' before introducing this feature and for two reasons we give it up. First, it leads to shuffle which may

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19274 Yes, I understand your scenario, but my concern is that your proposal is quite scenario specific, it may well serve your scenario, but somehow it breaks the design purpose of KafkaRDD. From my

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-26 Thread lonelytrooper
Github user lonelytrooper commented on the issue: https://github.com/apache/spark/pull/19274 lonelytrooper... : Pwill more executors be used in RDD#mapPartitions way ? I'll try that later to see if it works. I think if Spark provides a convenient way for this , it would help

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-26 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19274 Hi @loneknightpy , think a bit on your PR, I think this can also be done in the user side. User could create several threads in one task (RDD#mapPartitions) to consume the records concurrently,

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-21 Thread lonelytrooper
Github user lonelytrooper commented on the issue: https://github.com/apache/spark/pull/19274 Yes. One Kafka partition will map to many Spark partitions, thus more executors can be used. --- - To unsubscribe,

[GitHub] spark issue #19274: [SPARK-22056][Streaming] Add subconcurrency for KafkaRDD...

2017-09-21 Thread jerryshao
Github user jerryshao commented on the issue: https://github.com/apache/spark/pull/19274 Will this break the assumption that one Kafka partition will map to one Spark partition? --- - To unsubscribe, e-mail: