Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers
imran, great, i will take a look at the pullreq. seems we are interested in similar things On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid wrote: > I think writing to hdfs and reading it back again is totally reasonable. > In fact, in my experience, writing to hdfs and reading back in actually >

Re: spark disk-to-disk

2015-03-24 Thread Imran Rashid
I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well: a) instead of just writing as an object file, I've found its helpful to write in a form

Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
Maybe implement a very simple function that uses the Hadoop API to read in based on file names (i.e. parts)? On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers wrote: > there is a way to reinstate the partitioner, but that requires > sc.objectFile to read exactly what i wrote, which means sc.object

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > i jus

Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: > > On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > >> so finally i can resort to: >> rdd.saveAsObjectFile(...) >> sc.objectFile(...) >> but that seems like a rat

Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > so finally i can resort to: > rdd.saveAsObjectFile(...) > sc.objectFile(...) > but that seems like a rather broken abstraction. > > This seems like a fine solution to me.

spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage. the troubl