[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12700556#action_12700556 ]
He Yongqiang commented on HIVE-352: ----------------------------------- Agreed. Can we have both? 1 is absolutely better for high selectivity filter clauses. With 2, we can skip loading unnecessary (compressed) columns into memory. I have done a simple RCFile perform test in my local single machine. It seems RCFile perform much better in reading than block-compressed sequence file. I think the performance improvements should attribute to the skip strategy. The below is a coarse results of comparing RCFile with SequenceFile (in local): {noformat} Write RCFile with 10 random string columns and 100000 rows cost 9851 milliseconds. And the file's on disk size is 50527070 Read only one column of a RCFile with 10 random string columns and 100000 rows cost 448 milliseconds. Write SequenceFile with 10 random string columns and 100000 rows cost 18405 milliseconds. And the file's on disk size is 52684063 Read SequenceFile with 10 random string columns and 100000 rows cost 9418 milliseconds. Write RCFile with 25 random string columns and 100000 rows cost 15112 milliseconds. And the file's on disk size is 126262141 Read only one column of a RCFile with 25 random string columns and 100000 rows cost 467 milliseconds. Write SequenceFile with 25 random string columns and 100000 rows cost 45586 milliseconds. And the file's on disk size is 131355387 Read SequenceFile with 25 random string columns and 100000 rows cost 22013 milliseconds. {noformat} I will post more detailed test results together with next patch. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, > hive-352-2009-4-17.patch, HIve-352-draft-2009-03-28.patch, > Hive-352-draft-2009-03-30.patch > > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.