You are running on a local file system right? HDFS orders the file based on
names, but local file system often don't. I think that's why the difference.

We might be able to do a sort and order the partitions when we create a RDD
to make this universal though.

On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs <ewan.hi...@ugent.be> wrote:

> Hi all,
> Quick one: when reading files, are the orders of partitions guaranteed to
> be preserved? I am finding some weird behaviour where I run sortByKeys() on
> an RDD (which has 16 byte keys) and write it to disk. If I open a python
> shell and run the following:
>
> for part in range(29):
>     print map(ord, 
> open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
> 'r').read(16))
>
> Then each partition is in order based on the first value of each partition.
>
> I can also call TeraValidate.validate from TeraSort and it is happy with
> the results. It seems to be on loading the file that the reordering
> happens. If this is expected, is there a way to ask Spark nicely to give me
> the RDD in the order it was saved?
>
> This is based on trying to fix my TeraValidate code on this branch:
> https://github.com/ehiggs/spark/tree/terasort
>
> Thanks,
> Ewan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Reply via email to