[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703993#action_12703993 ]
Zheng Shao commented on HIVE-352: --------------------------------- The following numbers are all for 128MB gzip compressed block (for seqfile, and 20% smaller for rcfile because of difference compression ratio) A. Read from seqfile + Write to seqfile: 2m 05s B. Read from seqfile + Write to rcfile: 2m 45s C. Read from rcfile + Write to seqfile: 2m 20s D. Read from rcfile + Write to rcfile: 3m 00s @Joydeep: The good compression ratio is mainly because we are compressing column length and column data (without delimiters) separately. In an earlier experiment I did, column-based compression only showed 7-8% improvements because I was compressing column data with delimiters. @Yongqiang: Did you turn on native compression when testing? Some performance improvement tips from the profiling: 1. BytesRefArrayWritable to use Java Array (BytesRefWritable[]) instead of List<BytesRefWritable> 2. RCFile$Writer.columnBuffers to use Java Array(ColumnBuffer[]) instead of List<ColumnBuffer> 3. Add a method in BytesRefArrayWritable to return the BytesRefWritable[] so that RCFile$Writer.append can operator on it directly. 1-3 will save us 10-15 seconds from B and D. 4. RCFIle$Writer$ColumnBuffer.append should directly call DataOutputStream.write and WritableUtils.writeVLong public void append(BytesRefWritable data) throws IOException { data.writeDataTo(columnValBuffer); WritableUtils.writeVInt(valLenBuffer, data.getLength()); } 4 will save 5-10 seconds from B and D. Following the same route, if there are any Lists that the number of elements do not usually change, we should use Java Array ([]) instead of List. Yongqiang, can you do step 1-4 and try to replace List with Array? > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 > progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, > hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, > hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, > hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, > HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch > > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.