Re: possible issues with listing objects in the HadoopFSrelation

Cheng Lian Wed, 12 Aug 2015 00:51:41 -0700

Hi Gil,

Sorry for the late reply and thanks for raising this question. The filelisting logic in HadoopFsRelation is intentionally made different fromHadoop FileInputFormat. Here are the reasons:

1. Efficiency: when computing RDD partitions,FileInputFormat.listStatus() is called on the driver side in asequential manner, and can be slow for S3 directories with lots ofsub-directories, e.g. partitioned tables with thousands or even morepartitions. This is partly because file metadata operation can be veryslow on S3. HadoopFsRelation relies on this file listing action to dopartition discovery, and we've made a distributed parallel version inSpark 1.5: we first list input paths on driver side in a sequentialbreadth-first manner, and once we find the number of directories to belisted exceeds a threshold (32 by default), we launch a Spark job to dofile listing. With this mechanism, we've observed 2 orders of magnitudeperformance boost when reading partitioned table with thousands ofdistinct partitions located on S3.

2. Semantics difference: the default hiddenFileFilter doesn't apply inevery cases. For example, Parquet summary files _metadata and_common_metadata plays crucial roles in schema discovery and schemamerging, and we don't want to exclude them when listing the files. Butthey are removed when reading the actual data. However, we probablyshould allow users to pass in user defined path filters.


Cheng

On 8/10/15 7:55 PM, Gil Vernik wrote:

Just some thoughts, hope i didn't missed something obvious.
HadoopFSRelation calls directly FileSystem class to list files in thepath.It looks like it implements basically the same logic as in theFileInputFormat.listStatus method ( located inhadoop-map-reduce-client-core)
The point is that HadoopRDD (or similar ) calls getSplits method thatcalls FileInputFormat.listStatus, while HadoopFSRelation callsFileSystem directly and both of them try to achieve "listing" of objects.
There might be various issues with this, for example this onehttps://issues.apache.org/jira/browse/SPARK-7868makes sure that"_temporary" is not returned in a result, but the the listing ofFileInputFormat contains more logic, it uses hidden PathFilter like this
*private**static**final*PathFilter */hiddenFileFilter/*=*new*PathFilter(){
*public**boolean*accept(Path p){
        String name= p.getName();
*return*!name.startsWith("_") && !name.startsWith(".");
      }
    };
In addition, custom FileOutputCommitter, may use other name than"_temporary" .
All this may lead that HadoopFSrelation and HadoopRDD will providedifferent lists from the same data source.
My question is: what the roadmap for this listing in HadoopFSrelation.Will it implement exactly the same logic like inFileInputFormat.listStatus, or may be one day HadoopFSrelation willcall FileInputFormat.listStatus and provide custom PathFilter orMultiPathFilter? This way there will be single code that list objects.
Thanks,
Gil.

Re: possible issues with listing objects in the HadoopFSrelation

Reply via email to