[
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen updated SPARK-7441:
-----------------------------
Target Version/s: (was: 1.6.0)
> Implement microbatch functionality so that Spark Streaming can process a
> large backlog of existing files discovered in batch in smaller batches
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SPARK-7441
> URL: https://issues.apache.org/jira/browse/SPARK-7441
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Reporter: Emre Sevinç
> Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60
> seconds by default, see SPARK-3276 for more details), this might mean that a
> Spark Streaming application can receive thousands, or hundreds of thousands
> of files within the first batch interval. This, in turn, leads to something
> like a 'flooding' effect for the streaming application, that tries to deal
> with a huge number of existing files in a single batch interval.
> We will propose a very simple change to
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a
> configuration property such as "{{spark.streaming.microbatch.size}}", it will
> either keep its default behavior when {{spark.streaming.microbatch.size}}
> will have the default value of {{0}} (meaning as many as has been discovered
> as new files in the current batch interval), or will process new files in
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running
> successfully for weeks (e.g. there were cases where our Spark Streaming
> application was stopped, and in the meantime tens of thousands file were
> created in a directory, and our Spark Streaming application had to process
> those existing files after it was started).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]