[ https://issues.apache.org/jira/browse/MAPREDUCE-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13911155#comment-13911155 ]
Vinod Kumar Vavilapalli commented on MAPREDUCE-5756: ---------------------------------------------------- "git blame" tells me this is introduced by MAPREDUCE-1981 which was committed to 0.23.10 and 2.1.1-beta. This is the interesting bit of that patch: {code} @@ -169,13 +171,17 @@ public static PathFilter getInputPathFilter(JobConf conf) { protected void addInputPathRecursively(List<FileStatus> result, FileSystem fs, Path path, PathFilter inputFilter) throws IOException { - for(FileStatus stat: fs.listStatus(path, inputFilter)) { - if (stat.isDirectory()) { - addInputPathRecursively(result, fs, stat.getPath(), inputFilter); - } else { - result.add(stat); + RemoteIterator<LocatedFileStatus> iter = fs.listLocatedStatus(path); + while (iter.hasNext()) { + LocatedFileStatus stat = iter.next(); + if (inputFilter.accept(stat.getPath())) { + if (stat.isDirectory()) { + addInputPathRecursively(result, fs, stat.getPath(), inputFilter); + } else { + result.add(stat); + } } - } + } } {code} Clearly, before 0.23.10 and 2.1.1-beta, the behavior was to exclude directories. So should we treat it as incorrect behavior and fix it? > CombineFileInputFormat.getSplits() including directories in its results > ----------------------------------------------------------------------- > > Key: MAPREDUCE-5756 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5756 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Jason Dere > > Trying to track down HIVE-6401, where we see some "is not a file" errors > because getSplits() is giving us directories. I believe the culprit is > FileInputFormat.listStatus(): > {code} > if (recursive && stat.isDirectory()) { > addInputPathRecursively(result, fs, stat.getPath(), > inputFilter); > } else { > result.add(stat); > } > {code} > Which seems to be allowing directories to be added to the results if > recursive is false. Is this meant to return directories? If not, I think it > should look like this: > {code} > if (stat.isDirectory()) { > if (recursive) { > addInputPathRecursively(result, fs, stat.getPath(), > inputFilter); > } > } else { > result.add(stat); > } > {code} -- This message was sent by Atlassian JIRA (v6.1.5#6160)