[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701839#action_12701839
 ] 

Zheng Shao commented on HIVE-352:
---------------------------------

@Yongqiang: I found a place in the SequenceFile reader test that may improve 
the performance a lot - BytesRefWritable.readFields is creating a new array for 
each row!! This is bad and I would say this is not a fair comparison between 
RCFile and SequenceFile.

There are 3 ways to fix BytesRefWritable:
1. Add a boolean member "owned", set it to true every time we create an array 
in readFields, and don't create another array if owned is true and the current 
record is equal or smaller than the current owned array. Also, set it to false 
every time set(...) is called.
2. Directly change the semantics of readFields - we always reuse the bytes 
array if length of bytes array is equal or greater to the current record, 
otherwise create a new one. This is OK because for people who uses set(...) 
they probably won't use readFields at all. Of course, we need to put a comment 
at readFields and set() says readFields will corrupt the array, so don't call 
readFields.
3. Use a completely different class hierarchy.

I would prefer to do 2 since it's the simplest way to go.

I hope this will improve the sequencefile read performance a lot, and give 
RCFile and SeqFile a fair comparison.


Also, you might want to modify the write code to use the same logic - reuse the 
bytes array if possible. Then the writes will be much faster as well.


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to