Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Hello,

Spark 1.1.0, Hadoop 2.4.1

I have written a Spark streaming application. And I am getting
FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
Here is brief what I am is trying to do.
My application is creating text file stream using Java Stream context. The
input file is on HDFS.

JavaDStreamString textStream = ssc.textFileStream(InputFile);

Then it is comparing each line of input stream with some data and filtering
it. The filtered data I am storing in JavaDStreamString.

 JavaDStreamString suspectedStream= textStream.flatMap(new
FlatMapFunctionString,String(){
@Override
public IterableString call(String line) throws
Exception {

ListString filteredList = new ArrayListString();

// doing filter job

return filteredList;
}

And this filteredList I am storing in HDFS as:

 suspectedStream.foreach(new
FunctionJavaRDDlt;String,Void(){
@Override
public Void call(JavaRDDString rdd) throws
Exception {
rdd.saveAsTextFile(outputFolderPath);
return null;
}});


But with this I am receiving 
org.apache.hadoop.mapred.FileAlreadyExistsException.

I tried with appending random number with outputFolderPath and its working. 
But my requirement is to collect all output in one directory. 

Can you please suggest if there is any way to get rid of this exception ?

Thanks,
  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Sameer Farooqui
Hi Shailesh,

Spark just leverages the Hadoop File Output Format to write out the RDD you
are saving.

This is really a Hadoop OutputFormat limitation which requires the
directory it is writing into to not exist. The idea is that a Hadoop job
should not be able to overwrite the results from a previous job, so it
enforces that the dir should not exist.

Easiest way to get around this may be to just write the results from each
Spark app to a newly named directory, then on an interval run a simple
script to merge data from multiple HDFS directories into one directory.

This HDFS command will let you do something like a directory merge:
hdfs dfs -cat /folderpath/folder* | hdfs dfs -copyFromLocal -
/newfolderpath/file

See this StackOverflow discussion for a way to do it using Pig and Bash
scripting also:
https://stackoverflow.com/questions/19979896/combine-map-reduce-output-from-different-folders-into-single-folder


Sameer F.
Client Services @ Databricks

On Tue, Oct 21, 2014 at 3:51 PM, Shailesh Birari sbir...@wynyardgroup.com
wrote:

 Hello,

 Spark 1.1.0, Hadoop 2.4.1

 I have written a Spark streaming application. And I am getting
 FileAlreadyExistsException for rdd.saveAsTextFile(outputFolderPath).
 Here is brief what I am is trying to do.
 My application is creating text file stream using Java Stream context. The
 input file is on HDFS.

 JavaDStreamString textStream = ssc.textFileStream(InputFile);

 Then it is comparing each line of input stream with some data and filtering
 it. The filtered data I am storing in JavaDStreamString.

  JavaDStreamString suspectedStream=
 textStream.flatMap(new
 FlatMapFunctionString,String(){
 @Override
 public IterableString call(String line) throws
 Exception {

 ListString filteredList = new
 ArrayListString();

 // doing filter job

 return filteredList;
 }

 And this filteredList I am storing in HDFS as:

  suspectedStream.foreach(new
 FunctionJavaRDDlt;String,Void(){
 @Override
 public Void call(JavaRDDString rdd) throws
 Exception {
 rdd.saveAsTextFile(outputFolderPath);
 return null;
 }});


 But with this I am receiving
 org.apache.hadoop.mapred.FileAlreadyExistsException.

 I tried with appending random number with outputFolderPath and its working.
 But my requirement is to collect all output in one directory.

 Can you please suggest if there is any way to get rid of this exception ?

 Thanks,
   Shailesh




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark Streaming - How to write RDD's in same directory ?

2014-10-21 Thread Shailesh Birari
Thanks Sameer for quick reply.

I will try to implement it.

  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-write-RDD-s-in-same-directory-tp16962p16970.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org