Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
what would good spill settings be? On Fri, Dec 12, 2014 at 2:45 PM, Sameer Farooqui wrote: > > You could try re-partitioning or coalescing the RDD to partition and then > write it to disk. Make sure you have good spill settings enabled so that > the RDD can spill to the local temp dirs if it has

Re: how to convert an rdd to a single output file

2014-12-12 Thread Sameer Farooqui
You could try re-partitioning or coalescing the RDD to partition and then write it to disk. Make sure you have good spill settings enabled so that the RDD can spill to the local temp dirs if it has to. On Fri, Dec 12, 2014 at 2:39 PM, Steve Lewis wrote: > > The objective is to let the Spark appli

Re: how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
The objective is to let the Spark application generate a file in a format which can be consumed by other programs - as I said I am willing to give up parallelism at this stage (all the expensive steps were earlier but do want an efficient way to pass once through an RDD without the requirement to h

Re: how to convert an rdd to a single output file

2014-12-12 Thread Sameer Farooqui
Instead of doing this on the compute side, I would just write out the file with different blocks initially into HDFS and then use "hadoop fs -getmerge" or HDFSConcat to get one final output file. - SF On Fri, Dec 12, 2014 at 11:19 AM, Steve Lewis wrote: > > > I have an RDD which is potentially

how to convert an rdd to a single output file

2014-12-12 Thread Steve Lewis
I have an RDD which is potentially too large to store in memory with collect. I want a single task to write the contents as a file to hdfs. Time is not a large issue but memory is. I say the following converting my RDD (scans) to a local Iterator. This works but hasNext shows up as a separate task