[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14091908#comment-14091908 ] Lefty Leverenz commented on HIVE-4123: -- Done, thanks [~prasanth_j]. Now the description for *hive.exec.orc.write.format* says: {quote} Define the version of the file to write. Possible values are 0.11 and 0.12. If this parameter is not defined, ORC will use the run length encoding (RLE) introduced in Hive 0.12. Any value other than 0.11 results in the 0.12 encoding. Additional values may be introduced in the future (see HIVE-6002). {quote} HIVE-6586 (for HiveConf.java updates) has a comment about the new description. * [Configuration Properties -- hive.exec.orc.write.format | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.write.format] * [HIVE-6586 comment about new description for hive.exec.orc.write.format | https://issues.apache.org/jira/browse/HIVE-6586?focusedCommentId=14091905page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14091905] The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: TODOC12, orcfile Fix For: 0.12.0 Attachments: HIVE-4123-8.patch, HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090350#comment-14090350 ] Prasanth J commented on HIVE-4123: -- Please go ahead and update the original description. At this point the only possible valid values are 0.11 and 0.12. As you had mentioned if the parameter is not defined or defined wrongly it will use the default 0.12 encoding. bq. Is that accurate? Can releases be specified as 0.12.0 or 0.13.1? Yes. Accurate. HIVE-6002 was trying to add patch number to the write version so that numbers can be specified as 0.12.1. But I don't think it will be committed until next major change to ORC writer. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: TODOC12, orcfile Fix For: 0.12.0 Attachments: HIVE-4123-8.patch, HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14089866#comment-14089866 ] Lefty Leverenz commented on HIVE-4123: -- Doc note: This added configuration parameter *hive.exec.orc.write.format* with a default value of 0.11, which was changed to null by HIVE-5091 before the release. *hive.exec.orc.write.format* is documented in the wiki here: * [Configuration Properties -- hive.exec.orc.write.format | https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-hive.exec.orc.write.format] The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123-8.patch, HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14090193#comment-14090193 ] Lefty Leverenz commented on HIVE-4123: -- Doc questions: Would it be okay to restore part of the original description for *hive.exec.orc.write.format* in the wiki (and later in HiveConf.java)? * current description is just Define the version of the file to write -- that doesn't give any idea about possible values, since the default is null, and it isn't clear that version of the file means Hive version * original description was use 0.11 version of RLE encoding. if this conf is not defined or any other value specified, ORC will use the new RLE encoding So I'd like to add Possible values are 0.11, 0.12, etc. If this parameter is not defined, ORC will use the RLE encoding introduced in Hive 0.12. Any value other than 0.11 results in the 0.12 encoding. Is that accurate? Can releases be specified as 0.12.0 or 0.13.1? The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: TODOC12, orcfile Fix For: 0.12.0 Attachments: HIVE-4123-8.patch, HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737891#comment-13737891 ] Hudson commented on HIVE-4123: -- ABORTED: Integrated in Hive-trunk-hadoop2 #354 (See [https://builds.apache.org/job/Hive-trunk-hadoop2/354/]) HIVE-4123 Improved ORC integer RLE version 2. (Prasanth Jayachandran via omalley) (omalley: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513155) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java.orig * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReaderV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriterV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/SerializationUtils.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestNewIntegerEncoding.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcNullOptimization.java * /hive/trunk/ql/src/test/resources/orc-file-dump-dictionary-threshold.out * /hive/trunk/ql/src/test/resources/orc-file-dump.out The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736773#comment-13736773 ] Hive QA commented on HIVE-4123: --- {color:green}Overall{color}: +1 all checks pass Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12597402/HIVE-4123.patch.txt {color:green}SUCCESS:{color} +1 2848 tests passed Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/400/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/400/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase {noformat} This message is automatically generated. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736931#comment-13736931 ] Brock Noland commented on HIVE-4123: [~owen.omalley] looks like your comment was accidently put in the Release Notes section. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737060#comment-13737060 ] Prasanth J commented on HIVE-4123: -- Thanks [~owen.omalley]for committing the patch! The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737282#comment-13737282 ] Hudson commented on HIVE-4123: -- SUCCESS: Integrated in Hive-trunk-hadoop1-ptest #124 (See [https://builds.apache.org/job/Hive-trunk-hadoop1-ptest/124/]) HIVE-4123 Improved ORC integer RLE version 2. (Prasanth Jayachandran via omalley) (omalley: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513155) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java.orig * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReaderV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriterV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/SerializationUtils.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestNewIntegerEncoding.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcNullOptimization.java * /hive/trunk/ql/src/test/resources/orc-file-dump-dictionary-threshold.out * /hive/trunk/ql/src/test/resources/orc-file-dump.out The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737398#comment-13737398 ] Hudson commented on HIVE-4123: -- FAILURE: Integrated in Hive-trunk-hadoop2-ptest #55 (See [https://builds.apache.org/job/Hive-trunk-hadoop2-ptest/55/]) HIVE-4123 Improved ORC integer RLE version 2. (Prasanth Jayachandran via omalley) (omalley: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513155) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java.orig * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReaderV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriterV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/SerializationUtils.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestNewIntegerEncoding.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcNullOptimization.java * /hive/trunk/ql/src/test/resources/orc-file-dump-dictionary-threshold.out * /hive/trunk/ql/src/test/resources/orc-file-dump.out The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13737695#comment-13737695 ] Hudson commented on HIVE-4123: -- SUCCESS: Integrated in Hive-trunk-h0.21 #2263 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2263/]) HIVE-4123 Improved ORC integer RLE version 2. (Prasanth Jayachandran via omalley) (omalley: http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1513155) * /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java * /hive/trunk/ql/src/gen/protobuf/gen-java/org/apache/hadoop/hive/ql/io/orc/OrcProto.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/IntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RecordReaderImpl.java.orig * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReader.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerReaderV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriter.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/RunLengthIntegerWriterV2.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/SerializationUtils.java * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java * /hive/trunk/ql/src/protobuf/org/apache/hadoop/hive/ql/io/orc/orc_proto.proto * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestBitPack.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestIntegerCompressionReader.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestNewIntegerEncoding.java * /hive/trunk/ql/src/test/org/apache/hadoop/hive/ql/io/orc/TestOrcNullOptimization.java * /hive/trunk/ql/src/test/resources/orc-file-dump-dictionary-threshold.out * /hive/trunk/ql/src/test/resources/orc-file-dump.out The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123-8.patch, HIVE-4123.8.txt, HIVE-4123.8.txt, HIVE-4123.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736097#comment-13736097 ] Owen O'Malley commented on HIVE-4123: - +1, it looks good to me. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735414#comment-13735414 ] Owen O'Malley commented on HIVE-4123: - Thanks, Prasanth! This is looking good. I can't find any callers for WriterImpl.getWriteFormat. Is that dead code? The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13735473#comment-13735473 ] Prasanth J commented on HIVE-4123: -- Yeah. Its not used anywhere. Sorry I forgot to remove that. I removed that method in this new patch. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, HIVE-4123.7.txt, HIVE-4123.8.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13734231#comment-13734231 ] Prasanth J commented on HIVE-4123: -- Thanks for the review Owen. I have addressed the following issues with this patch - Date type handled for new encoding - Better encoding check added by overriding checkEncoding() for valid types - Created factories for reader and writer creation - Indentation fix - DIRECT_V2 encoding can be turned on/off by using hive.exec.orc.write.format configuration parameter. If this parameter value is 0.11 then old RLE encoding will be used else if undefined or for any other values new RLE encoding will be used. Also, HIVE-4324 patch will get affected by this patch. So this new patch is generated on top of HIVE-4324. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731794#comment-13731794 ] Prasanth J commented on HIVE-4123: -- Code comment improvement/fixes, removed some redundant code, long repeat runs will directly use DELTA encoding instead of calling determineEncoding() function and few more changes added. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732274#comment-13732274 ] Eric Hanson commented on HIVE-4123: --- This is a great addition. Are you going to update the vectorized reader as well to read the updated format? The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732578#comment-13732578 ] Prasanth J commented on HIVE-4123: -- [~ehans]Sure. I can take a look at changes required for vectorized reader to read from this new encodings. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13732806#comment-13732806 ] Prasanth J commented on HIVE-4123: -- Updated the excel sheet. The excel sheet shows the comparison of existing RLE (baseline) vs the new RLE. The latest patch after code review shows better compression ratio when compared to old patch as well as the existing RLE. I have also added the encoding and decoding time to the excel sheet. The encoding and decoding times (in the excel sheet) are not very reliable since it is calculated for only 1 iteration. I also ran encoding/decoding over a 25M row file for 5 iterations and took the average of last 3 iterations. HIVE-4123.2.git.patch.txt took 2072ms on average for encoding 25M rows file and 920ms for decoding the encoded file. On the other hand, HIVE-4123.6.txt took 1374ms on average for encoding 25M rows file and 874ms for decoding the encoded file. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733115#comment-13733115 ] Owen O'Malley commented on HIVE-4123: - This is looking good, Prasanth. A couple more comments: * You need to handle the date type. * You should update the checkEncoding to only accept the encodings that are appropriate for each type (direct for binary, boolean, struct, and byte; direct_v2, dictionary, or dictionary_v2 for string; and direct or direct_v2 for most of the rest) * You should probably make a factory for creating the intreader so that you only have the code in one place. * The formatting on some of the new classes seems to use 8 spaces for indentation. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, HIVE-4123.6.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731431#comment-13731431 ] Owen O'Malley commented on HIVE-4123: - * Please remove the FIXME comment * Use the encoding for the column that is passed into startStripe. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731580#comment-13731580 ] Prasanth J commented on HIVE-4123: -- Following fixes were added to this patch - Removed FIXMEs - For determining the type of integer encoding (DIRECT/DIRECT_V2) used by dictionaries, a new encoding type DICTIONARY_V2 is added. DICTIONARY_V2 uses DIRECT_V2 encoding for dictionary data and length streams. In earlier patch, there is no way to determined if dictionaries used DIRECT or DIRECT_V2 encoding. This patch addresses this issue. I am not sure if there is any other way to determine this without adding new encoding type. - addressed code review comment related to having if/then/else in flush() method of RunLengthIntegerWriterV2 The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Affects Versions: 0.12.0 Reporter: Owen O'Malley Assignee: Prasanth J Labels: orcfile Fix For: 0.12.0 Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, HIVE-4123.5.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13729816#comment-13729816 ] Hive QA commented on HIVE-4123: --- {color:red}Overall{color}: -1 no tests executed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12594072/HIVE-4123.4.patch.txt Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/308/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/308/console Messages: {noformat} Executing org.apache.hive.ptest.execution.PrepPhase Tests failed with: NonZeroExitCodeException: Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]] + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-308/source-prep.txt + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'ant/src/org/apache/hadoop/hive/ant/antlib.xml' Reverted 'hbase-handler/ivy.xml' Reverted 'hbase-handler/src/test/org/apache/hadoop/hive/hbase/HBaseTestSetup.java' Reverted 'hbase-handler/src/test/org/apache/hadoop/hive/hbase/TestHBaseSerDe.java' Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat.java' Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HiveHBaseTableOutputFormat.java' Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseSerDe.java' Reverted 'hbase-handler/src/java/org/apache/hadoop/hive/hbase/HBaseStorageHandler.java' Reverted 'build.xml' Reverted 'ivy/libraries.properties' Reverted 'hcatalog/core/build.xml' Reverted 'hcatalog/pom.xml' Reverted 'hcatalog/build.properties' Reverted 'hcatalog/build.xml' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/snapshot/TestRevisionManager.java' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/snapshot/TestRevisionManagerEndpoint.java' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/ManyMiniCluster.java' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/TestHBaseDirectOutputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/TestHBaseBulkOutputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/TestHBaseInputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/TableSnapshot.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManagerProtocol.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/Transaction.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManager.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManagerEndpointClient.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManagerEndpoint.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/ZKBasedRevisionManager.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/ImportSequenceFile.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HbaseSnapshotRecordReader.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseHCatStorageHandler.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseBaseOutputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseDirectOutputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseBulkOutputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/HBaseInputFormat.java' Reverted 'hcatalog/storage-handlers/hbase/pom.xml' Reverted 'hcatalog/build-support/ant/build-common.xml' Reverted 'hcatalog/build-support/ant/deploy.xml' Reverted 'hcatalog/build-support/ant/checkstyle.xml' Reverted 'hcatalog/hcatalog-pig-adapter/src/test/java/org/apache/hcatalog/pig/TestE2EScenarios.java' Reverted 'build-common.xml' Reverted '.gitignore' Reverted 'ql/ivy.xml' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf build ant/src/org/apache/hadoop/hive/ant/SetSystemProperty.java hbase-handler/src/java/org/apache/hadoop/hive/hbase/PutWritable.java
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718825#comment-13718825 ] Prasanth J commented on HIVE-4123: -- {quote}Comments: merge Utils into SerializationUtils. use the zigzag encode/decode in the the SerializationUtils.read/writeVslong move Utils.nextLong to the test code Utils.getTotalBytesRequired should just use long math. (n * numBits + 7) / 8 should work Rename IntegerCompressionReader/Writer to RunLengthIntegerReader/WriterV2 {quote} Done. {quote} Create an interface IntegerReader that has: seek next skip {quote} Added hasNext() to interface as well. {quote} Make RunLengthIntegerReader and RunLengthIntegerReaderV2 implement IntegerReader The TreeReaders should declare the fields as IntegerReader. Each of the startStripe should use the encoding to create the right implementation of IntegerReader. We should do the same with an IntegerWriter interface. Replace fixedBitSizes with static methods in SerializationUtils: static int encodeBitWidth(int n) static int decodeBitWidth(int n) {quote} Done. {quote} Finding the percentiles seems expensive, we should look at an alternative {quote} Done. {quote} Why is the delta blob zigzag encoded? The sign should always be positive or negative for the entire run. {quote} Made the delta base field mandatory, blob is now directly bit packed. {quote} Maybe we could create an enum in the Writer that is the version to write that would look like enum OrcVersion { V0_11, V0_12 } and the StreamFactory could provide the version to the TreeWriters. {quote} Not done (as per your last comment about passing factory object) {quote} I don't see why bitpack reader/writer are more than static methods that read/write to the underlying stream. So I would have expected a method like writeInts(long[] data, int offset, int length, int numBits, OutputStream stream) and the corresponding one for reading. {quote} Added as a separate static method. Can we reuse BitFieldReader/BitFieldWriter which essentially does the same thing (except it deals with ints)? {quote} Utils.bytesToLongBE should take an input stream rather than a byte[]. {quote} Done. {quote} In IntegerCompressionReader: I'd write a method to translate the int into an opcode rather than use ordinal. It is probably worth remembering that you are in a repeat, so that you don't need to copy the value N times in short repeat. {quote} Done. {quote} It may be easier to loop through the base values and then run through the patches. You might even do three loops: unpack the main values, unpack the patches, add the base to each value. {quote} My initial implementation was running through 3 loops. But later I refactored it to do in a single loop. I think this current patch removed some complexity (removed zigzag and changed bitpacking). {quote} For patched based only the base is zigzag encoded. The rest of the values are always positive. For delta only the base and base delta are zigzag encoded. {quote} Good catch. Updated the patch. {quote} In IntegerCompressionWriter: You should give more comments about the patched base encoding. Instead of sorting for the percentiles, you could keep a count of how many values use each number of bits. {quote} Done. Nice idea! {quote} Replace the commented out printlns with LOG.debug surrounded by LOG.ifDebugEnabled flush should use if/then/else to prevent writing the data twice the constructor should probably call clear rather than risk having the default values be different in write, just copy the data with system.arraycopy instead of cloning the array {quote} Done. {quote} write should track whether the values are monotonically increasing or decreasing so that we know if delta applies there is a lot of duplication of effort in determine encoding {quote} write primarily deals with cutting the runs (determining the scope). There was some redundancy that I removed in the current patch. Also tracking min/max was wrong with the earlier which is fixed in the new patch. Earlier as and when a value is buffered min/max are updated. But this lead to wrong output in some cases. For example: 2 3 4 5 6 1 1 1 sequence has min value of 1, but this 1 is part of short repeat sequence. This same min value was used for initial delta run as well. min/max/monotonicity/delta computation/percentile are determined while iterating through the buffered values. {quote} if the sequence is both increasing and decreasing, it is constant and we should either use short literal or delta depending on the length delta encoding should return before doing the percentile work {quote} Currently, delta encoding returns before percentile computation. Short repeats are determined when buffering values. All other encodings are determined in determineEncoding(). {quote} How much unit test coverage do you have of the new code? {quote} I have unit
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13718830#comment-13718830 ] Prasanth J commented on HIVE-4123: -- Just noticed. Please ignore the formatting changes that slipped through in SerializationUtils.java. I will fix that in next version of patch. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13719083#comment-13719083 ] Prasanth J commented on HIVE-4123: -- Updated the patch with bug fix in patched base encoding. Formatting changes fixed in this patch. Added more test cases for patched base encoding that covers more edge cases. Also changes to TestFileDump has been removed, since the memory memory chooses stripe size based on available jvm memory which I vary for different test cases. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, HIVE-4123.3.patch.txt, HIVE-4123.4.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13717596#comment-13717596 ] Owen O'Malley commented on HIVE-4123: - {quote} 1) In the current implementation, I kept the delta base field as optional (used only for fixed delta runs) and zigzag encoded the delta blob so that we don't have to deal with sign of the deltas. I can change delta base field to mandatory field to store the base (absolute min) value of delta values and zigzag encode it. With base value and delta base value, we should be able to identify if the sequence is monotonically increasing or decreasing and also we can identify the sign of the delta values. I hope this is what you are looking for. Please correct me if my understanding is wrong. {quote} I think it will be worthwhile always having the delta base and keeping the additional delta as an unsigned remainder. {quote} 2) is there any way we can reuse the Orc's MAJOR and MINOR version as supported in HIVE-4724 to figure out if we need use new integer encoding or old integer encoding? {quote} Yeah, I need to add more framework for that code. I'm leaning toward passing in a factory object that creates the right integer encoder. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714795#comment-13714795 ] Owen O'Malley commented on HIVE-4123: - More comments: * I don't see why bitpack reader/writer are more than static methods that read/write to the underlying stream. So I would have expected a method like writeInts(long[] data, int offset, int length, int numBits, OutputStream stream) and the corresponding one for reading. * Utils.bytesToLongBE should take an input stream rather than a byte[]. * In IntegerCompressionReader: ** I'd write a method to translate the int into an opcode rather than use ordinal. ** It is probably worth remembering that you are in a repeat, so that you don't need to copy the value N times in short repeat. ** It may be easier to loop through the base values and then run through the patches. You might even do three loops: unpack the main values, unpack the patches, add the base to each value. ** For patched based only the base is zigzag encoded. The rest of the values are always positive. ** For delta only the base and base delta are zigzag encoded. * In IntegerCompressionWriter: ** You should give more comments about the patched base encoding. ** Instead of sorting for the percentiles, you could keep a count of how many values use each number of bits. ** Replace the commented out printlns with LOG.debug surrounded by LOG.ifDebugEnabled ** flush should use if/then/else to prevent writing the data twice ** the constructor should probably call clear rather than risk having the default values be different ** in write, just copy the data with system.arraycopy instead of cloning the array ** write should track whether the values are monotonically increasing or decreasing so that we know if delta applies ** there is a lot of duplication of effort in determine encoding ** if the sequence is both increasing and decreasing, it is constant and we should either use short literal or delta depending on the length ** delta encoding should return before doing the percentile work ** * How much unit test coverage do you have of the new code? * Have you run the encoder/decoder round trip over the github data to test it? The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13714282#comment-13714282 ] Prasanth J commented on HIVE-4123: -- Thanks Owen for the review comments. There are few things I want to make sure before submitting the next version of patch. 1) In the current implementation, I kept the delta base field as optional (used only for fixed delta runs) and zigzag encoded the delta blob so that we don't have to deal with sign of the deltas. I can change delta base field to mandatory field to store the base (absolute min) value of delta values and zigzag encode it. With base value and delta base value, we should be able to identify if the sequence is monotonically increasing or decreasing and also we can identify the sign of the delta values. I hope this is what you are looking for. Please correct me if my understanding is wrong. 2) is there any way we can reuse the Orc's MAJOR and MINOR version as supported in HIVE-4724 to figure out if we need use new integer encoding or old integer encoding? The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13712916#comment-13712916 ] Owen O'Malley commented on HIVE-4123: - Comments: * merge Utils into SerializationUtils. * use the zigzag encode/decode in the the SerializationUtils.read/writeVslong * move Utils.nextLong to the test code * Utils.getTotalBytesRequired should just use long math. (n * numBits + 7) / 8 should work * Rename IntegerCompressionReader/Writer to RunLengthIntegerReader/WriterV2 * Create an interface IntegerReader that has: ** seek ** next ** skip * Make RunLengthIntegerReader and RunLengthIntegerReaderV2 implement IntegerReader * The TreeReaders should declare the fields as IntegerReader. * Each of the startStripe should use the encoding to create the right implementation of IntegerReader. * We should do the same with an IntegerWriter interface. * Replace fixedBitSizes with static methods in SerializationUtils: ** static int encodeBitWidth(int n) ** static int decodeBitWidth(int n) * Finding the percentiles seems expensive, we should look at an alternative * Why is the delta blob zigzag encoded? The sign should always be positive or negative for the entire run. * Maybe we could create an enum in the Writer that is the version to write that would look like enum OrcVersion { V0_11, V0_12 } and the StreamFactory could provide the version to the TreeWriters. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710342#comment-13710342 ] Prasanth J commented on HIVE-4123: -- This patch improves upon the existing run length encoding for integers. As mentioned in the description, it uses bit packing for more tighter compression, improved run length and delta encoding and also it supports longer runs. This patch supports the following light weight compression techniques *SHORT_REPEAT* *DIRECT* *PATCHED_BASE* *DELTA* The description and format for these types are as below: *SHORT_REPEAT:* Used for short repeated integer sequences. * 1 byte header ** 2 bits for encoding type ** 3 bits for bytes required for repeating value ** 3 bits for repeat count (MIN_REPEAT + run length) * Blob - repeat value (fixed bytes) *DIRECT:* Used for random integer sequences whose number of bit requirement doesn't vary a lot. * 2 bytes header ** 1st byte *** 2 bits for encoding type *** 5 bits for fixed bit width of values in blob *** 1 bit for storing MSB of run length ** 2nd byte *** 8 bits for lower run length bits * Blob - fixed width * run length bits long *PATCHED_BASE:* Used for random integer sequences whose number of bit requirement varies beyond a threshold. * 4 bytes header ** 1st byte *** 2 bits for encoding type *** 5 bits for fixed bit width of values in blob *** 1 bit for storing MSB of run length ** 2nd byte *** 8 bits for lower run length bits ** 3rd byte *** 3 bits for bytes required for base value *** 5 bits for patch width ** 4th byte *** 3 bits for patch gap width *** 5 bits for patch length * Base value - base width * 8 bits * Data blob - fixed width * run length * Patch blob - (patch width + patch gap width) * patch length *DELTA:* Used for monotonically increasing or decreasing sequences, sequences with fixed delta values or long repeated sequences. * 2 bytes header ** 1st byte *** 2 bits for encoding type *** 5 bits for fixed bit width of values in blob *** 1 bit for storing MSB of run length ** 2nd byte *** 8 bits for lower run length bits * Base value - encoded as varint * Delta base (only long fixed delta runs) - zigzag encoded * Delta blob (variable delta runs) - zigzag encoded I have tested this new implementation with the current implementation and the comparison of compression ratio between the existing implementation and new implementation is shown in the attached excel sheet for various real world datasets. As seen from the comparison sheet the new implementation gives significant improvement in compression ratio over the existing implementation for most of the cases. NOTE: This patch is generated against the trunk after applying HIVE-4724 patch. [~owen.omalley] can you please review this patch and let me know your review comments? Also let me know if I need to upload this patch to phabricator. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Owen O'Malley The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4123) The RLE encoding for ORC can be improved
[ https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13710448#comment-13710448 ] Prasanth J commented on HIVE-4123: -- The earlier patch included .orig file generated while patching HIVE-4724. Removed .orig file in this new patch. The RLE encoding for ORC can be improved Key: HIVE-4123 URL: https://issues.apache.org/jira/browse/HIVE-4123 Project: Hive Issue Type: New Feature Components: File Formats Reporter: Owen O'Malley Assignee: Prasanth J Attachments: HIVE-4123.1.git.patch.txt, HIVE-4123.2.git.patch.txt, ORC-Compression-Ratio-Comparison.xlsx The run length encoding of integers can be improved: * tighter bit packing * allow delta encoding * allow longer runs -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira