[
https://issues.apache.org/jira/browse/MAPREDUCE-5756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13900772#comment-13900772
]
Jason Dere commented on MAPREDUCE-5756:
---------------------------------------
Ok, looking a little more at this .. so FileInputFormat.listStatus() is
returning the same results on hadoop-1 and hadoop-2, and it includes the
directories, so I guess listStatus() is not the issue. It looks like what
CombineFileInputFormat.getSplits() does with the file list after getting it is
different between hadoop-1 and hadoop-2, where hadoop-2 includes those
directories in the list of InputSplits:
(Hadoop 20S means hadoop 1.x)
{noformat}
2014-02-13 13:35:32,492 ERROR shims.HadoopShimsSecure
(HadoopShimsSecure.java:getSplits(345)) - ** Hadoop version: 0.20S
2014-02-13 13:35:32,492 ERROR shims.HadoopShimsSecure
(HadoopShimsSecure.java:getSplits(349)) - ** called super.getSplits():
[Paths:/000000_0:0+50 Locations:127.0.0.1:; ]
{noformat}
(Hadoop 23 means hadoop 2.x)
{noformat}
2014-02-13 13:38:12,425 ERROR shims.HadoopShimsSecure
(HadoopShimsSecure.java:getSplits(345)) - ** Hadoop version: 0.23
2014-02-13 13:38:12,425 ERROR shims.HadoopShimsSecure
(HadoopShimsSecure.java:getSplits(349)) - ** called super.getSplits():
[Paths:/000000_0:0+50 Locations:127.0.0.1:; ,
Paths:/Users:0+0,/build:0+0,/tmp:0+0,/user:0+0 Locations:; ]
{noformat}
> FileInputFormat.listStatus() including directories in its results
> -----------------------------------------------------------------
>
> Key: MAPREDUCE-5756
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-5756
> Project: Hadoop Map/Reduce
> Issue Type: Bug
> Reporter: Jason Dere
>
> Trying to track down HIVE-6401, where we see some "is not a file" errors
> because getSplits() is giving us directories. I believe the culprit is
> FileInputFormat.listStatus():
> {code}
> if (recursive && stat.isDirectory()) {
> addInputPathRecursively(result, fs, stat.getPath(),
> inputFilter);
> } else {
> result.add(stat);
> }
> {code}
> Which seems to be allowing directories to be added to the results if
> recursive is false. Is this meant to return directories? If not, I think it
> should look like this:
> {code}
> if (stat.isDirectory()) {
> if (recursive) {
> addInputPathRecursively(result, fs, stat.getPath(),
> inputFilter);
> }
> } else {
> result.add(stat);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)