Re: RDD order guarantees

Steve Loughran Wed, 06 May 2020 08:20:37 -0700

On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <[email protected]>
wrote:


> Hi,
>
> Sorry to dig out this thread but this bug is still present.
>
> The fix proposed in this thread (creating a new FileSystem implementation
> which sorts listed files) was rejected, with the suggestion that it is the
> FileInputFormat's responsibility to sort the file names if preserving
> partition order is desired:
> https://github.com/apache/spark/pull/4204
>
> Given that Spark RDDs are supposed to preserve the order of the collections
> they represent, this would still deserve to be fixed in Spark, I think. As
> a
> user, I expect that if I use saveAsTextFile and then load the resulting
> file
> with sparkContext.textFile, I obtain a dataset in the same order.
>
> Because Spark uses the FileInputFormats exposed by Hadoop, that would mean
> either patching Hadoop for it to sort file names directly (which is likely
> going to fail since Hadoop might not care about the ordering in general),


Don't see any guarantees in Hadoop about the order of listLocatedStatus
-and for the local FS you get what the OS gives you.

What isn't easy is to take an entire listing and sort it -not if it is
potentially millions of entries. That issue is why the newer FS list APIs
all return a RemoteIterator<>: incremental paging of values so reducing
payload of single RPC messages between HDFS client & namenode (HDFS) or
allowing for paged/incremental lists against object stores. You can't
provide incremental pages of results *and sort those results at the same
time*

Which, given they're my problem, means I wouldn't be happy with adding
"sort all listings" as a new restriction on FS semantics.



> or
> create subclasses of all Hadoop formats used in Spark, adding the required
> sorting to the listStatus method. This strikes me as less elegant than
> implementing a new FileSystem as suggested by Reynold, though.
>
>
Again, you've got some scale issues to deal with -but as FileInputFormat
builds a list it's already in trouble if you point it at a sufficiently
large directory tree

Best thing to do would be to add entries to a treemap during the recursive
treewalk and then serve it up ordered from there -no need to do a sort @
the end.

But: trying to subclass all Hadoop formats is itself troublesome. If you go
that way: make it an optional interface. And/or talk to the mapreduce
project about actually providing a base implementation



> Another way to "fix" this would be to mention in the docs that order is not
> preserved in this scenario, which could hopefully avoid bad surprises to
> others (just like we already have a caveat about nondeterminism of order
> after shuffles).
>
> I would be happy to try submitting a fix for this, if there is a consensus
> around the correct course of action.
>
> Even if it's not the final desired goal, it's a correct description of the
current state of the application ...

Re: RDD order guarantees

Reply via email to