[
https://issues.apache.org/jira/browse/SPARK-1312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tathagata Das updated SPARK-1312:
---------------------------------
Target Version/s: 1.2.0
Assignee: Tathagata Das
> Batch should read based on the batch interval provided in the StreamingContext
> ------------------------------------------------------------------------------
>
> Key: SPARK-1312
> URL: https://issues.apache.org/jira/browse/SPARK-1312
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 0.9.0
> Reporter: Sanjay Awatramani
> Assignee: Tathagata Das
> Priority: Minor
> Labels: sliding, streaming, window
>
> This problem primarily affects sliding window operations in spark streaming.
> Consider the following scenario:
> - a DStream is created from any source. (I've checked with file and socket)
> - No actions are applied to this DStream
> - Sliding Window operation is applied to this DStream and an action is
> applied to the sliding window.
> In this case, Spark will not even read the input stream in the batch in which
> the sliding interval isn't a multiple of batch interval. Put another way, it
> won't read the input when it doesn't have to apply the window function. This
> is happening because all transformations in Spark are lazy.
> How to fix this or workaround it (see line#3):
> JavaStreamingContext stcObj = new JavaStreamingContext(confObj, new
> Duration(1 * 60 * 1000));
> JavaDStream<String> inputStream = stcObj.textFileStream("/Input");
> inputStream.print(); // This is the workaround
> JavaDStream<String> objWindow = inputStream.window(new
> Duration(windowLen*60*1000), new Duration(slideInt*60*1000));
> objWindow.dstream().saveAsTextFiles("/Output", "");
> The "Window operations" example on the streaming guide implies that Spark
> will read the stream in every batch, which is not happening because of the
> lazy transformations.
> Wherever sliding window would be used, in most of the cases, no actions will
> be taken on the pre-window batch, hence my gut feeling was that Streaming
> would read every batch if any actions are being taken in the windowed stream.
> As per Tathagata,
> "Ideally every batch should read based on the batch interval provided in the
> StreamingContext."
> Refer the original thread on
> http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Window-operations-do-not-work-as-documented-tp2999.html
> for more details, including Tathagata's conclusion.
--
This message was sent by Atlassian JIRA
(v6.2#6252)