Hi Eugene,
in my case the list of values that I want to sort and write to a separate
file, its fairly small so the way I solved it is the following:
.groupByKey().foreach(e => {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig);
val newPath = rootPath+"/"+e._1;
Yiannis,
sorry for late response,
It is indeed not possible to create new RDD inside of foreachPartitions, so you
have to write data manually. I haven’t tried that and haven’t got such an
exception, but I’d assume you might try to write locally and them upload it
into HDFS. FileSystem has a s
Hi Eugene,
thanks for your response!
Your recommendation makes sense, that's what I more or less tried.
The problem that I am facing is that inside foreachPartition() I cannot
create a new rdd and use saveAsTextFile.
It would probably make sense to write directly to HDFS using the Java API.
When I
Yiannis ,
It looks like you might explore other approach.
sc.textFile("input/path")
.map() // your own implementation
.partitionBy(new HashPartitioner(num))
.groupBy() //your own implementation, as a result - PairRDD of key vs Iterable
of values
.foreachPartition()
On the last step you could so
Hi there,
I have been using the approach described here:
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
In addition to that, I was wondering if there is a way to set the customize
the order of those values contained in each file.
Thanks a lot!