[
https://issues.apache.org/jira/browse/SPARK-4734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
宿荣全 updated SPARK-4734:
-----------------------
Summary: [Streaming]limit the file Dstream size for each batch (was: limit
the file Dstream size for each batch)
> [Streaming]limit the file Dstream size for each batch
> -----------------------------------------------------
>
> Key: SPARK-4734
> URL: https://issues.apache.org/jira/browse/SPARK-4734
> Project: Spark
> Issue Type: New Feature
> Components: Streaming
> Reporter: 宿荣全
> Priority: Minor
>
> Streaming scan new files form the HDFS and process those files in each batch
> process.Current streaming exist some problems:
> 1.When the number of files is very large(the count size of those files is
> very large) in some batch segement.The processing time required will become
> very long.The processing time maybe over slideDuration time.Eventually lead
> to dispatch the next batch process is delay.
> 2.when the size of total file Dstream is very large in one batch,those
> dstream data do shuffle after memory will be n times increasing
> occupation,app will be slow or even terminated by operating system.
> So if we set a upper limit value of input data for each batch to control the
> batch process time,the job dispatch delay and the process delay wil be
> alleviated.
> modification:
> Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream
> (input data base class).the size of each batch process segments will be set
> in this parameter from [spark-defaults.conf] or setting in source.
> all implements class of InputDStream will do corresponding action be aimed at
> the segmentSizeThreshold.
> This patch is a modification about FileInputDStream ,so when find new files
> ,put those files's name and size in a queue and take elements package to a
> batch data with totail size < segmentSizeThreshold in
> FileInputDStream.Please look source about detailed logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]