宿荣全 created SPARK-4734:
--------------------------
Summary: limit the file Dstream size for each batch
Key: SPARK-4734
URL: https://issues.apache.org/jira/browse/SPARK-4734
Project: Spark
Issue Type: New Feature
Components: Streaming
Reporter: 宿荣全
Priority: Minor
Streaming scan new files form the HDFS and process those files in each batch
process.Current streaming exist some problems:
1.When the number of files is very large(the count size of those files is very
large) in some batch segement.The processing time required will become very
long.The processing time maybe over slideDuration time.Eventually lead to
dispatch the next batch process is delay.
2.when the size of total file Dstream is very large in one batch,those
dstream data do shuffle after memory will be n times increasing occupation,app
will be slow or even terminated by operating system.
So if we set a upper limit value of input data for each batch to control the
batch process time,the job dispatch delay and the process delay wil be
alleviated.
modification:
Add a new parameter "spark.streaming.segmentSizeThreshold" in InputDStream
(input data base class).the size of each batch process segments will be set in
this parameter from [spark-defaults.conf.template] or setting in source.
all implements class of InputDStream will do corresponding action be aimed at
the segmentSizeThreshold.
This patch is a modification about FileInputDStream ,so when find new files
,put those files's name and size in a queue and take elements package to a
batch data with totail size < segmentSizeThreshold in FileInputDStream.Please
look source about detailed logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]