On Tue, 7 Apr 2020 at 15:26, Antonin Delpeuch <li...@antonin.delpeuch.eu> wrote:
> Hi, > > Sorry to dig out this thread but this bug is still present. > > The fix proposed in this thread (creating a new FileSystem implementation > which sorts listed files) was rejected, with the suggestion that it is the > FileInputFormat's responsibility to sort the file names if preserving > partition order is desired: > https://github.com/apache/spark/pull/4204 > > Given that Spark RDDs are supposed to preserve the order of the collections > they represent, this would still deserve to be fixed in Spark, I think. As > a > user, I expect that if I use saveAsTextFile and then load the resulting > file > with sparkContext.textFile, I obtain a dataset in the same order. > > Because Spark uses the FileInputFormats exposed by Hadoop, that would mean > either patching Hadoop for it to sort file names directly (which is likely > going to fail since Hadoop might not care about the ordering in general), Don't see any guarantees in Hadoop about the order of listLocatedStatus -and for the local FS you get what the OS gives you. What isn't easy is to take an entire listing and sort it -not if it is potentially millions of entries. That issue is why the newer FS list APIs all return a RemoteIterator<>: incremental paging of values so reducing payload of single RPC messages between HDFS client & namenode (HDFS) or allowing for paged/incremental lists against object stores. You can't provide incremental pages of results *and sort those results at the same time* Which, given they're my problem, means I wouldn't be happy with adding "sort all listings" as a new restriction on FS semantics. > or > create subclasses of all Hadoop formats used in Spark, adding the required > sorting to the listStatus method. This strikes me as less elegant than > implementing a new FileSystem as suggested by Reynold, though. > > Again, you've got some scale issues to deal with -but as FileInputFormat builds a list it's already in trouble if you point it at a sufficiently large directory tree Best thing to do would be to add entries to a treemap during the recursive treewalk and then serve it up ordered from there -no need to do a sort @ the end. But: trying to subclass all Hadoop formats is itself troublesome. If you go that way: make it an optional interface. And/or talk to the mapreduce project about actually providing a base implementation > Another way to "fix" this would be to mention in the docs that order is not > preserved in this scenario, which could hopefully avoid bad surprises to > others (just like we already have a caveat about nondeterminism of order > after shuffles). > > I would be happy to try submitting a fix for this, if there is a consensus > around the correct course of action. > > Even if it's not the final desired goal, it's a correct description of the current state of the application ...