[ https://issues.apache.org/jira/browse/HIVE-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13760410#comment-13760410 ]
Phabricator commented on HIVE-5102: ----------------------------------- omalley has commented on the revision "HIVE-5102 [jira] ORC getSplits should create splits based the stripes". INLINE COMMENTS ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:247 +1 ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:261 Yes, but you'd need to carefully consider the impact of globalizing the thread pool. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:262 We could do either. The final return type is InputSplit[], so this seemed more natural. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:288 I should comment, but I used this for testing. In the test code, it is useful to be able to refer to -1 to get the last split added. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:329 This is pretty standard for cases where there isn't a natural timeout. The only way for the threads to not finish is if HDFS is hanging, which would cause the previous code to hang forever. ql/src/java/org/apache/hadoop/hive/ql/io/orc/ReaderImpl.java:276 To handle the case where the byte is between 0x80 and 0xff. Natural promotion would convert those to negative numbers, which isn't desired. ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestInputOutputFormat.java:285-288 This patch changes the desired behavior. In the previous version, the orc file with 0 rows still created a single input split, which is what this test was testing. With the modified test, we ensure that we get no splits, which is the new expected behavior. ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java:69 +1 REVISION DETAIL https://reviews.facebook.net/D12579 BRANCH h-5102 ARCANIST PROJECT hive To: JIRA, ashutoshc, omalley > ORC getSplits should create splits based the stripes > ----------------------------------------------------- > > Key: HIVE-5102 > URL: https://issues.apache.org/jira/browse/HIVE-5102 > Project: Hive > Issue Type: Bug > Components: File Formats > Reporter: Owen O'Malley > Assignee: Owen O'Malley > Attachments: HIVE-5102.D12579.1.patch, HIVE-5102.D12579.2.patch > > > Currently ORC inherits getSplits from FileFormat, which basically makes a > split per an HDFS block. This can create too little parallelism and would be > better done by having getSplits look at the file footer and create splits > based on the stripes. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira