Re: spark disk-to-disk

Imran Rashid Tue, 24 Mar 2015 08:02:03 -0700

I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:

a) instead of just writing as an object file, I've found its helpful to
write in a format that is a little more readable.  Json if efficiency
doesn't matter :) or you could use something like avro, which at least has
a good set of command line tools.

b) when developing, I hate it when I introduce a bug in step 12 of a long
pipeline, and need to re-run the whole thing.  If you save to disk, you can
write a little application logic that realizes step 11 is already sitting
on disk, and just restart from there.

c) writing to disk is also a good opportunity to do a little crude
"auto-tuning" of the number of partitions.  You can look at the size of
each partition on hdfs, and then adjust the number of partitions.

And I completely agree that losing the partitioning info is a major
limitation -- I submitted a PR to help deal w/ it:

https://github.com/apache/spark/pull/4449

getting narrow dependencies w/ partitioners can lead to pretty big
performance improvements, so I do think its important to make it easily
accessible to the user.  Though now I'm thinking that maybe this api is a
little clunky, and this should get rolled into the other changes you are
proposing to hadoop RDD & friends -- but I'll go into more discussion on
that thread.

On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers <ko...@tresata.com> wrote:

> there is a way to reinstate the partitioner, but that requires
> sc.objectFile to read exactly what i wrote, which means sc.objectFile
> should never split files on reading (a feature of hadoop file inputformat
> that gets in the way here).
>
> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> i just realized the major limitation is that i lose partitioning info...
>>
>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin <r...@databricks.com> wrote:
>>
>>>
>>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> so finally i can resort to:
>>>> rdd.saveAsObjectFile(...)
>>>> sc.objectFile(...)
>>>> but that seems like a rather broken abstraction.
>>>>
>>>>
>>> This seems like a fine solution to me.
>>>
>>>
>>
>

Re: spark disk-to-disk

Reply via email to