You are running on a local file system right? HDFS orders the file based on names, but local file system often don't. I think that's why the difference.
We might be able to do a sort and order the partitions when we create a RDD to make this universal though. On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote: > Hi all, > Quick one: when reading files, are the orders of partitions guaranteed to > be preserved? I am finding some weird behaviour where I run sortByKeys() on > an RDD (which has 16 byte keys) and write it to disk. If I open a python > shell and run the following: > > for part in range(29): > print map(ord, > open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part), > 'r').read(16)) > > Then each partition is in order based on the first value of each partition. > > I can also call TeraValidate.validate from TeraSort and it is happy with > the results. It seems to be on loading the file that the reordering > happens. If this is expected, is there a way to ask Spark nicely to give me > the RDD in the order it was saved? > > This is based on trying to fix my TeraValidate code on this branch: > https://github.com/ehiggs/spark/tree/terasort > > Thanks, > Ewan > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >