imran,
great, i will take a look at the pullreq. seems we are interested in
similar things
On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid wrote:
> I think writing to hdfs and reading it back again is totally reasonable.
> In fact, in my experience, writing to hdfs and reading back in actually
>
I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:
a) instead of just writing as an object file, I've found its helpful to
write in a form
Maybe implement a very simple function that uses the Hadoop API to read in
based on file names (i.e. parts)?
On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers wrote:
> there is a way to reinstate the partitioner, but that requires
> sc.objectFile to read exactly what i wrote, which means sc.object
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which means sc.objectFile
should never split files on reading (a feature of hadoop file inputformat
that gets in the way here).
On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote:
> i jus
i just realized the major limitation is that i lose partitioning info...
On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote:
>
> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote:
>
>> so finally i can resort to:
>> rdd.saveAsObjectFile(...)
>> sc.objectFile(...)
>> but that seems like a rat
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote:
> so finally i can resort to:
> rdd.saveAsObjectFile(...)
> sc.objectFile(...)
> but that seems like a rather broken abstraction.
>
>
This seems like a fine solution to me.
i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.
the troubl