Just some thoughts, hope i didn't missed something obvious.

HadoopFSRelation calls directly FileSystem class to list files in the 
path.
It looks like it implements basically the same logic as in the 
FileInputFormat.listStatus method ( located in 
hadoop-map-reduce-client-core)

The point is that HadoopRDD (or similar ) calls getSplits method that 
calls FileInputFormat.listStatus, while HadoopFSRelation calls FileSystem 
directly and both of them try to achieve "listing" of objects.

There might be various issues with this, for example this one 
https://issues.apache.org/jira/browse/SPARK-7868 makes sure that 
"_temporary" is not returned in a result, but the the listing of 
FileInputFormat contains more logic,  it uses hidden PathFilter like this

  private static final PathFilter hiddenFileFilter = new PathFilter(){
      public boolean accept(Path p){
        String name = p.getName(); 
        return !name.startsWith("_") && !name.startsWith("."); 
      }
    }; 

In addition, custom FileOutputCommitter, may use other name than 
"_temporary" . 

All this may lead that HadoopFSrelation and HadoopRDD will provide 
different lists from the same data source.

My question is: what the roadmap for this listing in HadoopFSrelation. 
Will it implement exactly the same logic like in 
FileInputFormat.listStatus, or may be one day HadoopFSrelation will call 
FileInputFormat.listStatus and provide custom PathFilter or 
MultiPathFilter? This way there will be single  code that list objects.

Thanks,
Gil.


Reply via email to