[Spark Streaming] map and window operation on DStream only process one batch

Hao Ren Tue, 22 Nov 2016 05:49:14 -0800

Spark Streaming v 1.6.2
Kafka v0.10.1

I am reading msgs from Kafka.
What surprised me is the following DStream only process the first batch.


KafkaUtils.createDirectStream[
  String,
  String,
  StringDecoder,
  StringDecoder](streamingContext, kafkaParams, Set(topic))
  .map(_._2)
  .window(Seconds(windowLengthInSec))

Some logs as below are endlessly repeated:

16/11/22 14:20:40 INFO MappedDStream: Slicing from 1479820835000 ms to
1479820840000 ms (aligned to 1479820835000 ms and 1479820840000 ms)
16/11/22 14:20:40 INFO JobScheduler: Added jobs for time 1479820840000 ms

And the action on the DStream is just a rdd count

windowedStream foreachRDD { rdd => rdd.count }

>From the webUI, only the first batch is in status: Processing, the others
are all Queued.
However, if I permute map and window operation, everything is ok.

KafkaUtils.createDirectStream[
  String,
  String,
  StringDecoder,
  StringDecoder](streamingContext, kafkaParams, Set(topic))
  .window(Seconds(windowLengthInSec))
  .map(_._2)

I think the two are equivalent. But they are not.

Furthermore, if I replace my KafkaDStream with a QueueStream, it works for
no matter which order of map and window operation.

I am not sure whether this is related with KafkaDStream or just DStream.

Any help is appreciated.

-- 
Hao Ren

Data Engineer @ leboncoin

Paris, France

[Spark Streaming] map and window operation on DStream only process one batch

Reply via email to