Re: Sorted Multiple Outputs

2015-08-14 Thread Yiannis Gkoufas
Hi Eugene, in my case the list of values that I want to sort and write to a separate file, its fairly small so the way I solved it is the following: .groupByKey().foreach(e => { val hadoopConfig = new Configuration() val hdfs = FileSystem.get(hadoopConfig); val newPath = rootPath+"/"+e._1;

Re: Sorted Multiple Outputs

2015-08-12 Thread Eugene Morozov
Yiannis, sorry for late response, It is indeed not possible to create new RDD inside of foreachPartitions, so you have to write data manually. I haven’t tried that and haven’t got such an exception, but I’d assume you might try to write locally and them upload it into HDFS. FileSystem has a s

Re: Sorted Multiple Outputs

2015-07-16 Thread Yiannis Gkoufas
Hi Eugene, thanks for your response! Your recommendation makes sense, that's what I more or less tried. The problem that I am facing is that inside foreachPartition() I cannot create a new rdd and use saveAsTextFile. It would probably make sense to write directly to HDFS using the Java API. When I

Re: Sorted Multiple Outputs

2015-07-15 Thread Eugene Morozov
Yiannis , It looks like you might explore other approach. sc.textFile("input/path") .map() // your own implementation .partitionBy(new HashPartitioner(num)) .groupBy() //your own implementation, as a result - PairRDD of key vs Iterable of values .foreachPartition() On the last step you could so

Sorted Multiple Outputs

2015-07-14 Thread Yiannis Gkoufas
Hi there, I have been using the approach described here: http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job In addition to that, I was wondering if there is a way to set the customize the order of those values contained in each file. Thanks a lot!