[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701876#action_12701876 ]
Zheng Shao edited comment on HIVE-352 at 4/23/09 3:14 AM: ---------------------------------------------------------- Running Yongqiang's tests with hadoop native library, using DefaultCodec for both RCFile and SequenceFile. The file is on local file system. It seems RCFile's read performance is around 2 times of that of SequenceFiles, probably because we do bulk decompression and one less copy of data. This result looks reasonable. {code} Write RCFile with 80 random string columns and 100000 rows cost 25464 milliseconds. And the file's on disk size is 91874941 Write SequenceFile with 80 random string columns and 100000 rows cost 35711 milliseconds. And the file's on disk size is 102521005 Read only one column of a RCFile with 80 random string columns and 100000 rows cost 594 milliseconds. Read only first and last columns of a RCFile with 80 random string columns and 100000 rows cost 600 milliseconds. Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2227 milliseconds. Read SequenceFile with 80 random string columns and 100000 rows cost 4343 milliseconds. {code} This is the result using GzipCodec. Not much difference. {code} Write RCFile with 80 random string columns and 100000 rows cost 26358 milliseconds. And the file's on disk size is 91931563 Write SequenceFile with 80 random string columns and 100000 rows cost 35802 milliseconds. And the file's on disk size is 102528154 Read only one column of a RCFile with 80 random string columns and 100000 rows cost 593 milliseconds. Read only first and last columns of a RCFile with 80 random string columns and 100000 rows cost 626 milliseconds. Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2401 milliseconds. Read SequenceFile with 80 random string columns and 100000 rows cost 4601 milliseconds. {code} Each column is a random string length uniformly from 0 to 30, containing random uppercase and lowercase alphabets. was (Author: zshao): Running Yongqiang's tests with hadoop native library, using DefaultCodec for both RCFile and SequenceFile. It seems RCFile's read performance is around 2 times of that of SequenceFiles, probably because we do bulk decompression and one less copy of data. This result looks reasonable. {code} Write RCFile with 80 random string columns and 100000 rows cost 25464 milliseconds. And the file's on disk size is 91874941 Write SequenceFile with 80 random string columns and 100000 rows cost 35711 milliseconds. And the file's on disk size is 102521005 Read only one column of a RCFile with 80 random string columns and 100000 rows cost 594 milliseconds. Read only first and last columns of a RCFile with 80 random string columns and 100000 rows cost 600 milliseconds. Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2227 milliseconds. Read SequenceFile with 80 random string columns and 100000 rows cost 4343 milliseconds. {code} This is the result using GzipCodec. Not much difference. {code} Write RCFile with 80 random string columns and 100000 rows cost 26358 milliseconds. And the file's on disk size is 91931563 Write SequenceFile with 80 random string columns and 100000 rows cost 35802 milliseconds. And the file's on disk size is 102528154 Read only one column of a RCFile with 80 random string columns and 100000 rows cost 593 milliseconds. Read only first and last columns of a RCFile with 80 random string columns and 100000 rows cost 626 milliseconds. Read all columns of a RCFile with 80 random string columns and 100000 rows cost 2401 milliseconds. Read SequenceFile with 80 random string columns and 100000 rows cost 4601 milliseconds. {code} Each column is a random string length uniformly from 0 to 30, containing random uppercase and lowercase alphabets. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > Assignee: He Yongqiang > Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 > progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, > hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, > hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, > hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, > Hive-352-draft-2009-03-30.patch > > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.