[jira] [Commented] (HADOOP-15403) FileInputFormat recursive=false fails instead of ignoring the directories.

Jason Lowe (JIRA) Mon, 23 Apr 2018 07:54:21 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448264#comment-16448264
 ]


Jason Lowe commented on HADOOP-15403:
-------------------------------------

Does this have backward compatibility ramifications?  The default for 
mapreduce.input.fileinputformat.input.dir.recursive is false, so unless users 
changed it the jobs are failing today if the input contains directories.  If we 
change the behavior to ignore directories that could lead to lead to silent 
data loss if the job tried to consume an input location that now suddenly 
contains some directories.

In short: is it OK to assume the users will be aware of and agree with the new 
behavior?  Is there any way for users to revert to the old behavior if they do 
not want any inputs to be silently ignored?

> FileInputFormat recursive=false fails instead of ignoring the directories.
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-15403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15403
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HADOOP-15403.patch
>
>
> We are trying to create a split in Hive that will only read files in a 
> directory and not subdirectories.
> That fails with the below error.
> Given how this error comes about (two pieces of code interact, one explicitly 
> adding directories to results without failing, and one failing on any 
> directories in results), this seems like a bug.
> {noformat}
> Caused by: java.io.IOException: Not a file: 
> file:/,...warehouse/simple_to_mm_text/delta_0000001_0000001_0000
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329) 
> ~[hadoop-mapreduce-client-core-3.1.0.jar:?]
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:553)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:754)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:203)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
> {noformat}
> This code, when recursion is disabled, adds directories to results 
> {noformat} 
> if (recursive && stat.isDirectory()) {
>               result.dirsNeedingRecursiveCalls.add(stat);
>             } else {
>               result.locatedFileStatuses.add(stat);
>             }
> {noformat} 
> However the getSplits code after that computes the size like this
> {noformat}
> long totalSize = 0;                           // compute total size
>     for (FileStatus file: files) {                // check we have valid files
>       if (file.isDirectory()) {
>         throw new IOException("Not a file: "+ file.getPath());
>       }
>       totalSize +=
> {noformat}
> which would always fail combined with the above code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (HADOOP-15403) FileInputFormat recursive=false fails instead of ignoring the directories.

Reply via email to