Re: spark disk-to-disk

2015-03-24 Thread Koert Kuipers
imran,
great, i will take a look at the pullreq. seems we are interested in
similar things


On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid  wrote:

> I think writing to hdfs and reading it back again is totally reasonable.
> In fact, in my experience, writing to hdfs and reading back in actually
> gives you a good opportunity to handle some other issues as well:
>
> a) instead of just writing as an object file, I've found its helpful to
> write in a format that is a little more readable.  Json if efficiency
> doesn't matter :) or you could use something like avro, which at least has
> a good set of command line tools.
>
> b) when developing, I hate it when I introduce a bug in step 12 of a long
> pipeline, and need to re-run the whole thing.  If you save to disk, you can
> write a little application logic that realizes step 11 is already sitting
> on disk, and just restart from there.
>
> c) writing to disk is also a good opportunity to do a little crude
> "auto-tuning" of the number of partitions.  You can look at the size of
> each partition on hdfs, and then adjust the number of partitions.
>
> And I completely agree that losing the partitioning info is a major
> limitation -- I submitted a PR to help deal w/ it:
>
> https://github.com/apache/spark/pull/4449
>
> getting narrow dependencies w/ partitioners can lead to pretty big
> performance improvements, so I do think its important to make it easily
> accessible to the user.  Though now I'm thinking that maybe this api is a
> little clunky, and this should get rolled into the other changes you are
> proposing to hadoop RDD & friends -- but I'll go into more discussion on
> that thread.
>
>
>
> On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers  wrote:
>
>> there is a way to reinstate the partitioner, but that requires
>> sc.objectFile to read exactly what i wrote, which means sc.objectFile
>> should never split files on reading (a feature of hadoop file inputformat
>> that gets in the way here).
>>
>> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers  wrote:
>>
>>> i just realized the major limitation is that i lose partitioning info...
>>>
>>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin 
>>> wrote:
>>>

 On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers 
 wrote:

> so finally i can resort to:
> rdd.saveAsObjectFile(...)
> sc.objectFile(...)
> but that seems like a rather broken abstraction.
>
>
 This seems like a fine solution to me.


>>>
>>
>


Re: spark disk-to-disk

2015-03-24 Thread Imran Rashid
I think writing to hdfs and reading it back again is totally reasonable.
In fact, in my experience, writing to hdfs and reading back in actually
gives you a good opportunity to handle some other issues as well:

a) instead of just writing as an object file, I've found its helpful to
write in a format that is a little more readable.  Json if efficiency
doesn't matter :) or you could use something like avro, which at least has
a good set of command line tools.

b) when developing, I hate it when I introduce a bug in step 12 of a long
pipeline, and need to re-run the whole thing.  If you save to disk, you can
write a little application logic that realizes step 11 is already sitting
on disk, and just restart from there.

c) writing to disk is also a good opportunity to do a little crude
"auto-tuning" of the number of partitions.  You can look at the size of
each partition on hdfs, and then adjust the number of partitions.

And I completely agree that losing the partitioning info is a major
limitation -- I submitted a PR to help deal w/ it:

https://github.com/apache/spark/pull/4449

getting narrow dependencies w/ partitioners can lead to pretty big
performance improvements, so I do think its important to make it easily
accessible to the user.  Though now I'm thinking that maybe this api is a
little clunky, and this should get rolled into the other changes you are
proposing to hadoop RDD & friends -- but I'll go into more discussion on
that thread.



On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers  wrote:

> there is a way to reinstate the partitioner, but that requires
> sc.objectFile to read exactly what i wrote, which means sc.objectFile
> should never split files on reading (a feature of hadoop file inputformat
> that gets in the way here).
>
> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers  wrote:
>
>> i just realized the major limitation is that i lose partitioning info...
>>
>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin  wrote:
>>
>>>
>>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers 
>>> wrote:
>>>
 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


>>> This seems like a fine solution to me.
>>>
>>>
>>
>


Re: spark disk-to-disk

2015-03-23 Thread Reynold Xin
Maybe implement a very simple function that uses the Hadoop API to read in
based on file names (i.e. parts)?

On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers  wrote:

> there is a way to reinstate the partitioner, but that requires
> sc.objectFile to read exactly what i wrote, which means sc.objectFile
> should never split files on reading (a feature of hadoop file inputformat
> that gets in the way here).
>
> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers  wrote:
>
>> i just realized the major limitation is that i lose partitioning info...
>>
>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin  wrote:
>>
>>>
>>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers 
>>> wrote:
>>>
 so finally i can resort to:
 rdd.saveAsObjectFile(...)
 sc.objectFile(...)
 but that seems like a rather broken abstraction.


>>> This seems like a fine solution to me.
>>>
>>>
>>
>


Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
there is a way to reinstate the partitioner, but that requires
sc.objectFile to read exactly what i wrote, which means sc.objectFile
should never split files on reading (a feature of hadoop file inputformat
that gets in the way here).

On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers  wrote:

> i just realized the major limitation is that i lose partitioning info...
>
> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin  wrote:
>
>>
>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers  wrote:
>>
>>> so finally i can resort to:
>>> rdd.saveAsObjectFile(...)
>>> sc.objectFile(...)
>>> but that seems like a rather broken abstraction.
>>>
>>>
>> This seems like a fine solution to me.
>>
>>
>


Re: spark disk-to-disk

2015-03-23 Thread Koert Kuipers
i just realized the major limitation is that i lose partitioning info...

On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin  wrote:

>
> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers  wrote:
>
>> so finally i can resort to:
>> rdd.saveAsObjectFile(...)
>> sc.objectFile(...)
>> but that seems like a rather broken abstraction.
>>
>>
> This seems like a fine solution to me.
>
>


Re: spark disk-to-disk

2015-03-22 Thread Reynold Xin
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers  wrote:

> so finally i can resort to:
> rdd.saveAsObjectFile(...)
> sc.objectFile(...)
> but that seems like a rather broken abstraction.
>
>
This seems like a fine solution to me.


spark disk-to-disk

2015-03-22 Thread Koert Kuipers
i would like to use spark for some algorithms where i make no attempt to
work in memory, so read from hdfs and write to hdfs for every step.
of course i would like every step to only be evaluated once. and i have no
need for spark's RDD lineage info, since i persist to reliable storage.

the trouble is, i am not sure how to proceed.

rdd.checkpoint() seems like the obvious candidate to force my computations
to write to hdfs for intermediate data and cut the lineage, but
rdd.checkpoint() does not actually trigger a job. rdd.checkpoint() runs
after some other action triggered a job, leading to recomputation.

the suggestion in the docs is to do:
rdd.cache(); rdd.checkpoint()
but that wont work for me since the data does not fit in memory.

instead i could do:
rdd.persist(StorageLevel.DISK_ONLY_2); rdd.checkpoint()
but that leads to the data being written to disk twice in a row, which
seems wasteful.

so finally i can resort to:
rdd.saveAsObjectFile(...)
sc.objectFile(...)
but that seems like a rather broken abstraction.

any ideas? i feel like i am missing something obvious. or i am running yet
again into spark's historical in-memory bias?