[
https://issues.apache.org/jira/browse/HADOOP-10340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13899589#comment-13899589
]
Jason Lowe commented on HADOOP-10340:
-------------------------------------
Looking at the 1.x code, it appears it will also add directories to the results
but somewhat inconsistently. It will only add them if they are not immediately
under the initial input path. From the FileInputFormat.listStatus() code:
{code}
FileStatus[] matches = fs.globStatus(p, inputFilter);
if (matches == null) {
errors.add(new IOException("Input path does not exist: " + p));
} else if (matches.length == 0) {
errors.add(new IOException("Input Pattern " + p + " matches 0 files"));
} else {
for (FileStatus globStat: matches) {
if (globStat.isDir()) {
for(FileStatus stat: fs.listStatus(globStat.getPath(),
inputFilter)) {
result.add(stat);
}
} else {
result.add(globStat);
}
{code}
Note how it blindly just adds all the results of the second-level directory
listing to the results rather than recursing the directory handling logic.
That inconsistent directory handling in 1.x seems like a bug to me. However
note that it does not skip any directories -- it either adds the contents of
the directory or the directory itself. I don't think it's OK to skip the
directory entirely when gathering the input or we could easily, silently drop
input data for the job.
> FileInputFormat.listStatus() including directories in its results
> -----------------------------------------------------------------
>
> Key: HADOOP-10340
> URL: https://issues.apache.org/jira/browse/HADOOP-10340
> Project: Hadoop Common
> Issue Type: Bug
> Reporter: Jason Dere
>
> Trying to track down HIVE-6401, where we see some "is not a file" errors
> because getSplits() is giving us directories. I believe the culprit is
> FileInputFormat.listStatus():
> {code}
> if (recursive && stat.isDirectory()) {
> addInputPathRecursively(result, fs, stat.getPath(),
> inputFilter);
> } else {
> result.add(stat);
> }
> {code}
> Which seems to be allowing directories to be added to the results if
> recursive is false. Is this meant to return directories? If not, I think it
> should look like this:
> {code}
> if (stat.isDirectory()) {
> if (recursive) {
> addInputPathRecursively(result, fs, stat.getPath(),
> inputFilter);
> }
> } else {
> result.add(stat);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)