Re: RDD order guarantees

Ewan Higgs Mon, 19 Jan 2015 07:06:03 -0800

Hi Reynold.
I'll take a look.

SPARK-5300 is open for this issue.
-Ewan


On 19/01/15 08:39, Reynold Xin wrote:

Hi Ewan,

Not sure if there is a JIRA ticket (there are too many that I lose track).

I chatted briefly with Aaron on this. The way we can solve it is tocreate a new FileSystem implementation that overrides the listStatusmethod, and then in Hadoop Conf set the fs.file.impl to that.


Shouldn't be too hard. Would you be interested in working on it?

On Fri, Jan 16, 2015 at 3:36 PM, Ewan Higgs <[email protected]<mailto:[email protected]>> wrote:


    Yes, I am running on a local file system.

    Is there a bug open for this? Mingyu Kim reported the problem last
    April:
    
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html

    -Ewan


    On 01/16/2015 07:41 PM, Reynold Xin wrote:

    You are running on a local file system right? HDFS orders the
    file based on names, but local file system often don't. I think
    that's why the difference.

    We might be able to do a sort and order the partitions when we
    create a RDD to make this universal though.

    On Fri, Jan 16, 2015 at 8:26 AM, Ewan Higgs <[email protected]
    <mailto:[email protected]>> wrote:

        Hi all,
        Quick one: when reading files, are the orders of partitions
        guaranteed to be preserved? I am finding some weird behaviour
        where I run sortByKeys() on an RDD (which has 16 byte keys)
        and write it to disk. If I open a python shell and run the
        following:

        for part in range(29):
            print map(ord,
        open('/home/ehiggs/data/terasort_out/part-r-000{0:02}'.format(part),
        'r').read(16))

        Then each partition is in order based on the first value of
        each partition.

        I can also call TeraValidate.validate from TeraSort and it is
        happy with the results. It seems to be on loading the file
        that the reordering happens. If this is expected, is there a
        way to ask Spark nicely to give me the RDD in the order it
        was saved?

        This is based on trying to fix my TeraValidate code on this
        branch:
        https://github.com/ehiggs/spark/tree/terasort

        Thanks,
        Ewan

        ---------------------------------------------------------------------
        To unsubscribe, e-mail: [email protected]
        <mailto:[email protected]>
        For additional commands, e-mail: [email protected]
        <mailto:[email protected]>

Re: RDD order guarantees

Reply via email to