Spark Streaming v 1.6.2 Kafka v0.10.1 I am reading msgs from Kafka. What surprised me is the following DStream only process the first batch.
KafkaUtils.createDirectStream[ String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, Set(topic)) .map(_._2) .window(Seconds(windowLengthInSec)) Some logs as below are endlessly repeated: 16/11/22 14:20:40 INFO MappedDStream: Slicing from 1479820835000 ms to 1479820840000 ms (aligned to 1479820835000 ms and 1479820840000 ms) 16/11/22 14:20:40 INFO JobScheduler: Added jobs for time 1479820840000 ms And the action on the DStream is just a rdd count windowedStream foreachRDD { rdd => rdd.count } >From the webUI, only the first batch is in status: Processing, the others are all Queued. However, if I permute map and window operation, everything is ok. KafkaUtils.createDirectStream[ String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, Set(topic)) .window(Seconds(windowLengthInSec)) .map(_._2) I think the two are equivalent. But they are not. Furthermore, if I replace my KafkaDStream with a QueueStream, it works for no matter which order of map and window operation. I am not sure whether this is related with KafkaDStream or just DStream. Any help is appreciated. -- Hao Ren Data Engineer @ leboncoin Paris, France