[
https://issues.apache.org/jira/browse/SPARK-3553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14232801#comment-14232801
]
Micael Capitão commented on SPARK-3553:
---------------------------------------
I confirm the weird behaviour running in HDFS too.
I have the Spark Streaming app with a filestream on dir
"hdfs:///user/altaia/cdrs/stream"
Having initially these files:
[1] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_6_06_20.txt.gz
[2] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_7_11_01.txt.gz
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz
When I start the application, they are processed. When I add a new file [7] by
renaming it to end with .gz it is processed too.
[7]
hdfs://blade2.ct.ptin.corppt.com:8020/user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_36_34.txt.gz
But right after the [7], Spark Streaming reprocesses some of the initially
present files:
[3] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_8_41_01.txt.gz
[4] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_06_58.txt.gz
[5] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_41_01.txt.gz
[6] hdfs:///user/altaia/cdrs/stream/Terminais_3G_VOZ_14_07_2013_9_57_13.txt.gz
And does not repeat anything else on the next batches. When adding yet another
file, it is not detected and stays like that.
> Spark Streaming app streams files that have already been streamed in an
> endless loop
> ------------------------------------------------------------------------------------
>
> Key: SPARK-3553
> URL: https://issues.apache.org/jira/browse/SPARK-3553
> Project: Spark
> Issue Type: Bug
> Components: Streaming
> Affects Versions: 1.0.1
> Environment: Ec2 cluster - YARN
> Reporter: Ezequiel Bella
> Labels: S3, Streaming, YARN
>
> We have a spark streaming app deployed in a YARN ec2 cluster with 1 name node
> and 2 data nodes. We submit the app with 11 executors with 1 core and 588 MB
> of RAM each.
> The app streams from a directory in S3 which is constantly being written;
> this is the line of code that achieves that:
> val lines = ssc.fileStream[LongWritable, Text,
> TextInputFormat](Settings.S3RequestsHost , (f:Path)=> true, true )
> The purpose of using fileStream instead of textFileStream is to customize the
> way that spark handles existing files when the process starts. We want to
> process just the new files that are added after the process launched and omit
> the existing ones. We configured a batch duration of 10 seconds.
> The process goes fine while we add a small number of files to s3, let's say 4
> or 5. We can see in the streaming UI how the stages are executed successfully
> in the executors, one for each file that is processed. But when we try to add
> a larger number of files, we face a strange behavior; the application starts
> streaming files that have already been streamed.
> For example, I add 20 files to s3. The files are processed in 3 batches. The
> first batch processes 7 files, the second 8 and the third 5. No more files
> are added to S3 at this point, but spark start repeating these phases
> endlessly with the same files.
> Any thoughts what can be causing this?
> Regards,
> Easyb
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]