Re: spark disk-to-disk
imran, great, i will take a look at the pullreq. seems we are interested in similar things On Tue, Mar 24, 2015 at 11:00 AM, Imran Rashid wrote: > I think writing to hdfs and reading it back again is totally reasonable. > In fact, in my experience, writing to hdfs and reading back in actually > gives you a good opportunity to handle some other issues as well: > > a) instead of just writing as an object file, I've found its helpful to > write in a format that is a little more readable. Json if efficiency > doesn't matter :) or you could use something like avro, which at least has > a good set of command line tools. > > b) when developing, I hate it when I introduce a bug in step 12 of a long > pipeline, and need to re-run the whole thing. If you save to disk, you can > write a little application logic that realizes step 11 is already sitting > on disk, and just restart from there. > > c) writing to disk is also a good opportunity to do a little crude > "auto-tuning" of the number of partitions. You can look at the size of > each partition on hdfs, and then adjust the number of partitions. > > And I completely agree that losing the partitioning info is a major > limitation -- I submitted a PR to help deal w/ it: > > https://github.com/apache/spark/pull/4449 > > getting narrow dependencies w/ partitioners can lead to pretty big > performance improvements, so I do think its important to make it easily > accessible to the user. Though now I'm thinking that maybe this api is a > little clunky, and this should get rolled into the other changes you are > proposing to hadoop RDD & friends -- but I'll go into more discussion on > that thread. > > > > On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers wrote: > >> there is a way to reinstate the partitioner, but that requires >> sc.objectFile to read exactly what i wrote, which means sc.objectFile >> should never split files on reading (a feature of hadoop file inputformat >> that gets in the way here). >> >> On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: >> >>> i just realized the major limitation is that i lose partitioning info... >>> >>> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin >>> wrote: >>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > so finally i can resort to: > rdd.saveAsObjectFile(...) > sc.objectFile(...) > but that seems like a rather broken abstraction. > > This seems like a fine solution to me. >>> >> >
Re: spark disk-to-disk
I think writing to hdfs and reading it back again is totally reasonable. In fact, in my experience, writing to hdfs and reading back in actually gives you a good opportunity to handle some other issues as well: a) instead of just writing as an object file, I've found its helpful to write in a format that is a little more readable. Json if efficiency doesn't matter :) or you could use something like avro, which at least has a good set of command line tools. b) when developing, I hate it when I introduce a bug in step 12 of a long pipeline, and need to re-run the whole thing. If you save to disk, you can write a little application logic that realizes step 11 is already sitting on disk, and just restart from there. c) writing to disk is also a good opportunity to do a little crude "auto-tuning" of the number of partitions. You can look at the size of each partition on hdfs, and then adjust the number of partitions. And I completely agree that losing the partitioning info is a major limitation -- I submitted a PR to help deal w/ it: https://github.com/apache/spark/pull/4449 getting narrow dependencies w/ partitioners can lead to pretty big performance improvements, so I do think its important to make it easily accessible to the user. Though now I'm thinking that maybe this api is a little clunky, and this should get rolled into the other changes you are proposing to hadoop RDD & friends -- but I'll go into more discussion on that thread. On Mon, Mar 23, 2015 at 12:55 PM, Koert Kuipers wrote: > there is a way to reinstate the partitioner, but that requires > sc.objectFile to read exactly what i wrote, which means sc.objectFile > should never split files on reading (a feature of hadoop file inputformat > that gets in the way here). > > On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > >> i just realized the major limitation is that i lose partitioning info... >> >> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: >> >>> >>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers >>> wrote: >>> so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. >>> This seems like a fine solution to me. >>> >>> >> >
Re: spark disk-to-disk
Maybe implement a very simple function that uses the Hadoop API to read in based on file names (i.e. parts)? On Mon, Mar 23, 2015 at 10:55 AM, Koert Kuipers wrote: > there is a way to reinstate the partitioner, but that requires > sc.objectFile to read exactly what i wrote, which means sc.objectFile > should never split files on reading (a feature of hadoop file inputformat > that gets in the way here). > > On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > >> i just realized the major limitation is that i lose partitioning info... >> >> On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: >> >>> >>> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers >>> wrote: >>> so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. >>> This seems like a fine solution to me. >>> >>> >> >
Re: spark disk-to-disk
there is a way to reinstate the partitioner, but that requires sc.objectFile to read exactly what i wrote, which means sc.objectFile should never split files on reading (a feature of hadoop file inputformat that gets in the way here). On Mon, Mar 23, 2015 at 1:39 PM, Koert Kuipers wrote: > i just realized the major limitation is that i lose partitioning info... > > On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: > >> >> On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: >> >>> so finally i can resort to: >>> rdd.saveAsObjectFile(...) >>> sc.objectFile(...) >>> but that seems like a rather broken abstraction. >>> >>> >> This seems like a fine solution to me. >> >> >
Re: spark disk-to-disk
i just realized the major limitation is that i lose partitioning info... On Mon, Mar 23, 2015 at 1:34 AM, Reynold Xin wrote: > > On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > >> so finally i can resort to: >> rdd.saveAsObjectFile(...) >> sc.objectFile(...) >> but that seems like a rather broken abstraction. >> >> > This seems like a fine solution to me. > >
Re: spark disk-to-disk
On Sun, Mar 22, 2015 at 6:03 PM, Koert Kuipers wrote: > so finally i can resort to: > rdd.saveAsObjectFile(...) > sc.objectFile(...) > but that seems like a rather broken abstraction. > > This seems like a fine solution to me.
spark disk-to-disk
i would like to use spark for some algorithms where i make no attempt to work in memory, so read from hdfs and write to hdfs for every step. of course i would like every step to only be evaluated once. and i have no need for spark's RDD lineage info, since i persist to reliable storage. the trouble is, i am not sure how to proceed. rdd.checkpoint() seems like the obvious candidate to force my computations to write to hdfs for intermediate data and cut the lineage, but rdd.checkpoint() does not actually trigger a job. rdd.checkpoint() runs after some other action triggered a job, leading to recomputation. the suggestion in the docs is to do: rdd.cache(); rdd.checkpoint() but that wont work for me since the data does not fit in memory. instead i could do: rdd.persist(StorageLevel.DISK_ONLY_2); rdd.checkpoint() but that leads to the data being written to disk twice in a row, which seems wasteful. so finally i can resort to: rdd.saveAsObjectFile(...) sc.objectFile(...) but that seems like a rather broken abstraction. any ideas? i feel like i am missing something obvious. or i am running yet again into spark's historical in-memory bias?