[jira] [Commented] (HADOOP-15403) FileInputFormat recursive=false fails instead of ignoring the directories.

Jason Lowe (JIRA) Mon, 23 Apr 2018 11:02:35 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448602#comment-16448602
 ]


Jason Lowe commented on HADOOP-15403:
-------------------------------------

bq. would a change in config be ok?

A change in the default value for a config is arguably the same thing as a code 
change that changes the default behavior from the perspective of a user.

To be clear I'm not saying we can't ever change the default behavior, but we 
need to be careful about the ramifications.  If we do, it needs to be marked as 
an incompatible change and have a corresponding release note that clearly 
explains the potential for silent data loss relative to the old behavior and 
what users can do to restore the old behavior.

Given the behavior for non-recursive has been this way for quite a long time, 
either users aren't running into this very often or they've set the value to 
recursive.  That leads me to suggest adding the ability to ignore directories 
but _not_ make it the default.  Then we don't have a backward incompatibility 
and the one Hive case you're trying can still work once the config is updated 
(or Hive can run the job with that setting automatically if it makes sense for 
that use case).


> FileInputFormat recursive=false fails instead of ignoring the directories.
> --------------------------------------------------------------------------
>
>                 Key: HADOOP-15403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15403
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HADOOP-15403.patch
>
>
> We are trying to create a split in Hive that will only read files in a 
> directory and not subdirectories.
> That fails with the below error.
> Given how this error comes about (two pieces of code interact, one explicitly 
> adding directories to results without failing, and one failing on any 
> directories in results), this seems like a bug.
> {noformat}
> Caused by: java.io.IOException: Not a file: 
> file:/,...warehouse/simple_to_mm_text/delta_0000001_0000001_0000
>       at 
> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329) 
> ~[hadoop-mapreduce-client-core-3.1.0.jar:?]
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:553)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:754)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
>       at 
> org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:203)
>  ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT]
> {noformat}
> This code, when recursion is disabled, adds directories to results 
> {noformat} 
> if (recursive && stat.isDirectory()) {
>               result.dirsNeedingRecursiveCalls.add(stat);
>             } else {
>               result.locatedFileStatuses.add(stat);
>             }
> {noformat} 
> However the getSplits code after that computes the size like this
> {noformat}
> long totalSize = 0;                           // compute total size
>     for (FileStatus file: files) {                // check we have valid files
>       if (file.isDirectory()) {
>         throw new IOException("Not a file: "+ file.getPath());
>       }
>       totalSize +=
> {noformat}
> which would always fail combined with the above code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Commented] (HADOOP-15403) FileInputFormat recursive=false fails instead of ignoring the directories.

Reply via email to