[
https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-11308:
---------------------------------
Labels: bulk-closed (was: )
> Change spark streaming's job scheduler logic to ensuer guaranteed order of
> batch processing
> -------------------------------------------------------------------------------------------
>
> Key: SPARK-11308
> URL: https://issues.apache.org/jira/browse/SPARK-11308
> Project: Spark
> Issue Type: Improvement
> Components: DStreams
> Affects Versions: 1.5.1
> Reporter: Renjie Liu
> Priority: Major
> Labels: bulk-closed
>
> In current implementation, spark streaming uses a thread pool to run jobs
> generated in each time interval and orders are not guaranteed, i.e., if jobs
> generated in time 1 takes time longer than the batch duration, jobs 2 will
> begin to execute and the finish order is not guaranteed. This implementation
> is not quite useful in practice since it may cost much more storage. For
> example, when we do a word count in spark streaming, to be accurate we need
> to store records for each batch rather than just word count in database to be
> idempotent. But if the processing order of each batch is guaranteed, we just
> need to store the last update time with word count in database to be
> idempotent. Just simply set the thread pool size to 1 may cause the system to
> be inefficient when there are more than one output streams. This feature can
> be implemented by giving each output stream a thread and jobs of each output
> stream are executed in order.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]