[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700556#action_12700556
 ] 

He Yongqiang commented on HIVE-352:
-----------------------------------

Agreed.
Can we have both?
1 is absolutely better for high selectivity filter clauses. With 2, we can skip 
loading unnecessary (compressed) columns into memory. 
I have done a simple RCFile perform test in my local single machine. It seems 
RCFile perform much better in reading than block-compressed sequence file. I 
think the performance improvements should attribute to the skip strategy.
The below is a coarse results of comparing RCFile with SequenceFile (in local):
{noformat}
Write RCFile with 10 random string columns and 100000 rows cost 9851 
milliseconds. And the file's on disk size is 50527070
Read only one column of a RCFile with 10 random string columns and 100000 rows 
cost 448 milliseconds.
Write SequenceFile with 10  random string columns and 100000 rows cost 18405 
milliseconds. And the file's on disk size is 52684063
Read SequenceFile with 10  random string columns and 100000 rows cost 9418 
milliseconds.
Write RCFile with 25 random string columns and 100000 rows cost 15112 
milliseconds. And the file's on disk size is 126262141
Read only one column of a RCFile with 25 random string columns and 100000 rows 
cost 467 milliseconds.
Write SequenceFile with 25  random string columns and 100000 rows cost 45586 
milliseconds. And the file's on disk size is 131355387
Read SequenceFile with 25  random string columns and 100000 rows cost 22013 
milliseconds.
{noformat}

I will post more detailed test results together with next patch.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, HIve-352-draft-2009-03-28.patch, 
> Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to