Just some thoughts, hope i didn't missed something obvious. HadoopFSRelation calls directly FileSystem class to list files in the path. It looks like it implements basically the same logic as in the FileInputFormat.listStatus method ( located in hadoop-map-reduce-client-core)
The point is that HadoopRDD (or similar ) calls getSplits method that calls FileInputFormat.listStatus, while HadoopFSRelation calls FileSystem directly and both of them try to achieve "listing" of objects. There might be various issues with this, for example this one https://issues.apache.org/jira/browse/SPARK-7868 makes sure that "_temporary" is not returned in a result, but the the listing of FileInputFormat contains more logic, it uses hidden PathFilter like this private static final PathFilter hiddenFileFilter = new PathFilter(){ public boolean accept(Path p){ String name = p.getName(); return !name.startsWith("_") && !name.startsWith("."); } }; In addition, custom FileOutputCommitter, may use other name than "_temporary" . All this may lead that HadoopFSrelation and HadoopRDD will provide different lists from the same data source. My question is: what the roadmap for this listing in HadoopFSrelation. Will it implement exactly the same logic like in FileInputFormat.listStatus, or may be one day HadoopFSrelation will call FileInputFormat.listStatus and provide custom PathFilter or MultiPathFilter? This way there will be single code that list objects. Thanks, Gil.