[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

Sean Owen (JIRA) Sat, 14 Nov 2015 05:43:05 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-7441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sean Owen updated SPARK-7441:
-----------------------------
    Target Version/s:   (was: 1.6.0)

> Implement microbatch functionality so that Spark Streaming can process a 
> large backlog of existing files discovered in batch in smaller batches
> -----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-7441
>                 URL: https://issues.apache.org/jira/browse/SPARK-7441
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Emre Sevinç
>              Labels: performance
>
> Implement microbatch functionality so that Spark Streaming can process a huge 
> backlog of existing files discovered in batch in smaller batches.
> Spark Streaming can process already existing files in a directory, and 
> depending on the value of "{{spark.streaming.minRememberDuration}}" (60 
> seconds by default, see SPARK-3276 for more details), this might mean that a 
> Spark Streaming application can receive thousands, or hundreds of thousands 
> of files within the first batch interval. This, in turn, leads to something 
> like a 'flooding' effect for the streaming application, that tries to deal 
> with a huge number of existing files in a single batch interval.
>  We will propose a very simple change to 
> {{org.apache.spark.streaming.dstream.FileInputDStream}}, so that, based on a 
> configuration property such as "{{spark.streaming.microbatch.size}}", it will 
> either keep its default behavior when  {{spark.streaming.microbatch.size}} 
> will have the default value of {{0}} (meaning as many as has been discovered 
> as new files in the current batch interval), or will process new files in 
> groups of {{spark.streaming.microbatch.size}} (e.g. in groups of 100s).
> We have tested this patch in one of our customers, and it's been running 
> successfully for weeks (e.g. there were cases where our Spark Streaming 
> application was stopped, and in the meantime tens of thousands file were 
> created in a directory, and our Spark Streaming application had to process 
> those existing files after it was started).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-7441) Implement microbatch functionality so that Spark Streaming can process a large backlog of existing files discovered in batch in smaller batches

Reply via email to