[
https://issues.apache.org/jira/browse/SPARK-7122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14568854#comment-14568854
]
Nicolas PHUNG commented on SPARK-7122:
--------------------------------------
For _KafkaUtils.createStream_, jobs take between 13 ms and 0,3s. In details,
stages take between 13ms and 0,3s with split into 1 to 3 tasks. From the
streaming menu in Spark UI, processing time at 75th percentile is 112 ms and
maximum is 358 ms.
For _KafkaUtils.createDirectStream_, jobs take between 13 ms and 7s. In
details, stages take between 13ms and 7s are split between 275 to 400 tasks.
My kafka topic has 400 partitions that can explain the task split in
_KafkaUtils.createDirectStream_. But I don't understand why it gets behind
whereas _KafkaUtils.createStream_ can keep up for the same processing
_foreachrdd_ (I mean reprocess all from the beginning + keep up to newer/recent
events in Kafka). Of course, I'm using the same executors Spark configuration
for both (core/ram). Or maybe I'm doing something wrong somewhere.
> KafkaUtils.createDirectStream - unreasonable processing time in absence of
> load
> -------------------------------------------------------------------------------
>
> Key: SPARK-7122
> URL: https://issues.apache.org/jira/browse/SPARK-7122
> Project: Spark
> Issue Type: Question
> Components: Streaming
> Affects Versions: 1.3.1
> Environment: Spark Streaming 1.3.1, standalone mode running on just 1
> box: Ubuntu 14.04.2 LTS, 4 cores, 8GB RAM, java version "1.8.0_40"
> Reporter: Platon Potapov
> Priority: Minor
> Attachments: 10.second.window.fast.job.txt,
> 5.second.window.slow.job.txt, SparkStreamingJob.scala
>
>
> attached is the complete source code of a test spark job. no external data
> generators are run - just the presence of a kafka topic named "raw" suffices.
> the spark job is run with no load whatsoever. http://localhost:4040/streaming
> is checked to obtain job processing duration.
> * in case the test contains the following transformation:
> {code}
> // dummy transformation
> val temperature = bytes.filter(_._1 == "abc")
> val abc = temperature.window(Seconds(40), Seconds(5))
> abc.print()
> {code}
> the median processing time is 3 seconds 80 ms
> * in case the test contains the following transformation:
> {code}
> // dummy transformation
> val temperature = bytes.filter(_._1 == "abc")
> val abc = temperature.map(x => (1, x))
> abc.print()
> {code}
> the median processing time is just 50 ms
> please explain why does the "window" transformation introduce such a growth
> of job duration?
> note: the result is the same regardless of the number of kafka topic
> partitions (I've tried 1 and 8)
> note2: the result is the same regardless of the window parameters (I've tried
> (20, 2) and (40, 5))
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]