Hi Shailesh,
Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.
This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a previous job, so it
enforces that the dir should not exist.
Easiest way to get around this may be to just write the results from each
Spark app to a newly named directory, then on an interval run a simple
script to merge data from multiple HDFS directories into one directory.
This HDFS command will let you do something like a directory merge:
hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal -
/newfolderpath/file
See this StackOverflow discussion for a way to do it using Pig and Bash
scripting also:
https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder
Sameer F.
Client Services @ Databricks
On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari sbir...@wynyardgroup.com
wrote:
Hello,
Spark 1.1.0, Hadoop 2.4.1
I have written a Spark streaming application. And I am getting
FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
Here is brief what I am is trying to do.
My application is creating text file stream using Java Stream context. The
input file is on HDFS.
JavaDStreamString textStream = ssc.textFileStream(InputFile);
Then it is comparing each line of input stream with some data and filtering
it. The filtered data I am storing in JavaDStreamString.
JavaDStreamString suspectedStream=
textStream.flatMap(new
FlatMapFunctionString,String(){
@Override
public IterableString call(String line) throws
Exception {
ListString filteredList = new
ArrayListString();
// doing filter job
return filteredList;
}
And this filteredList I am storing in HDFS as:
suspectedStream.foreach(new
FunctionJavaRDDlt;String,Void(){
@Override
public Void call(JavaRDDString rdd) throws
Exception {
rdd.saveAsTextFile(outputFolderPath);
return null;
}});
But with this I am receiving
org.apache.hadoop.mapred.FileAlreadyExistsException.
I tried with appending random number with outputFolderPath and its working.
But my requirement is to collect all output in one directory.
Can you please suggest if there is any way to get rid of this exception ?
Thanks,
Shailesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org