[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12703993#action_12703993
 ] 

Zheng Shao commented on HIVE-352:
---------------------------------

The following numbers are all for 128MB gzip compressed block (for seqfile, and 
20% smaller for rcfile because of difference compression ratio)
A. Read from seqfile + Write to seqfile: 2m 05s
B. Read from seqfile + Write to rcfile: 2m 45s
C. Read from rcfile + Write to seqfile: 2m 20s
D. Read from rcfile + Write to rcfile: 3m 00s

@Joydeep: The good compression ratio is mainly because we are compressing 
column length and column data (without delimiters) separately.  In an earlier 
experiment I did, column-based compression only showed 7-8% improvements 
because I was compressing column data with delimiters.

@Yongqiang: Did you turn on native compression when testing?

Some performance improvement tips from the profiling:
1. BytesRefArrayWritable to use Java Array (BytesRefWritable[]) instead of 
List<BytesRefWritable>
2. RCFile$Writer.columnBuffers to use Java Array(ColumnBuffer[]) instead of 
List<ColumnBuffer>
3. Add a method in BytesRefArrayWritable to return the BytesRefWritable[] so 
that RCFile$Writer.append can operator on it directly.
1-3 will save us 10-15 seconds from B and D.
4. RCFIle$Writer$ColumnBuffer.append should directly call 
DataOutputStream.write and WritableUtils.writeVLong
     public void append(BytesRefWritable data) throws IOException {
        data.writeDataTo(columnValBuffer);
        WritableUtils.writeVInt(valLenBuffer, data.getLength());
      }
4 will save 5-10 seconds from B and D.

Following the same route, if there are any Lists that the number of elements do 
not usually change, we should use Java Array ([]) instead of List.

Yongqiang, can you do step 1-4 and try to replace List with Array?


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, 
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to