[ https://issues.apache.org/jira/browse/HADOOP-15403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16448602#comment-16448602 ]
Jason Lowe commented on HADOOP-15403: ------------------------------------- bq. would a change in config be ok? A change in the default value for a config is arguably the same thing as a code change that changes the default behavior from the perspective of a user. To be clear I'm not saying we can't ever change the default behavior, but we need to be careful about the ramifications. If we do, it needs to be marked as an incompatible change and have a corresponding release note that clearly explains the potential for silent data loss relative to the old behavior and what users can do to restore the old behavior. Given the behavior for non-recursive has been this way for quite a long time, either users aren't running into this very often or they've set the value to recursive. That leads me to suggest adding the ability to ignore directories but _not_ make it the default. Then we don't have a backward incompatibility and the one Hive case you're trying can still work once the config is updated (or Hive can run the job with that setting automatically if it makes sense for that use case). > FileInputFormat recursive=false fails instead of ignoring the directories. > -------------------------------------------------------------------------- > > Key: HADOOP-15403 > URL: https://issues.apache.org/jira/browse/HADOOP-15403 > Project: Hadoop Common > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Sergey Shelukhin > Priority: Major > Attachments: HADOOP-15403.patch > > > We are trying to create a split in Hive that will only read files in a > directory and not subdirectories. > That fails with the below error. > Given how this error comes about (two pieces of code interact, one explicitly > adding directories to results without failing, and one failing on any > directories in results), this seems like a bug. > {noformat} > Caused by: java.io.IOException: Not a file: > file:/,...warehouse/simple_to_mm_text/delta_0000001_0000001_0000 > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:329) > ~[hadoop-mapreduce-client-core-3.1.0.jar:?] > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:553) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:754) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:203) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > {noformat} > This code, when recursion is disabled, adds directories to results > {noformat} > if (recursive && stat.isDirectory()) { > result.dirsNeedingRecursiveCalls.add(stat); > } else { > result.locatedFileStatuses.add(stat); > } > {noformat} > However the getSplits code after that computes the size like this > {noformat} > long totalSize = 0; // compute total size > for (FileStatus file: files) { // check we have valid files > if (file.isDirectory()) { > throw new IOException("Not a file: "+ file.getPath()); > } > totalSize += > {noformat} > which would always fail combined with the above code. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org