[ https://issues.apache.org/jira/browse/SPARK-11308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Renjie Liu updated SPARK-11308: ------------------------------- Description: In current implementation, spark streaming uses a thread pool to run jobs generated in each time interval and orders are not guaranteed, i.e., if jobs generated in time 1 takes time longer than the batch duration, jobs 2 will begin to execute and the finish order is not guaranteed. This implementation is not quite useful in practice since it may cost much more storage. For example, when we do a word count in spark streaming, to be accurate we need to store records for each batch rather than just word count in database to be idempotent. But if the processing order of each batch is guaranteed, we just need to store the last update time with word count in database to be idempotent. Just simply set the thread pool size to 1 may cause the system to be inefficient when there are more than one output streams. This feature can be implemented by giving each output stream a thread and jobs of each output stream are executed in order. (was: In current implementation, spark streaming uses a thread pool to run jobs generated in each time interval and orders are not guaranteed, i.e., if jobs generated in time 1 takes time longer than the batch duration, jobs 2 will begin to execute and the finish order is not guaranteed. This implementation is not quite useful in practice since it may cost much more storage. For example, when we do a word count in spark streaming, to be accurate we need to store records for each batch rather than just word count in database. But if the processing order of each batch is guaranteed, we just need to store the last update time with word count in database to be idempotent. Just simply set the thread pool size to 1 may cause the system to be inefficient when there are more than one output streams. This feature can be implemented by giving each output stream a thread and jobs of each output stream are executed in order.) > Change spark streaming's job scheduler logic to ensuer guaranteed order of > batch processing > ------------------------------------------------------------------------------------------- > > Key: SPARK-11308 > URL: https://issues.apache.org/jira/browse/SPARK-11308 > Project: Spark > Issue Type: Improvement > Components: Streaming > Affects Versions: 1.5.1 > Reporter: Renjie Liu > Priority: Minor > > In current implementation, spark streaming uses a thread pool to run jobs > generated in each time interval and orders are not guaranteed, i.e., if jobs > generated in time 1 takes time longer than the batch duration, jobs 2 will > begin to execute and the finish order is not guaranteed. This implementation > is not quite useful in practice since it may cost much more storage. For > example, when we do a word count in spark streaming, to be accurate we need > to store records for each batch rather than just word count in database to be > idempotent. But if the processing order of each batch is guaranteed, we just > need to store the last update time with word count in database to be > idempotent. Just simply set the thread pool size to 1 may cause the system to > be inefficient when there are more than one output streams. This feature can > be implemented by giving each output stream a thread and jobs of each output > stream are executed in order. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org