[jira] Issue Comment Edited: (HIVE-352) Make Hive support column based storage

Zheng Shao (JIRA) Thu, 23 Apr 2009 03:15:14 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701876#action_12701876
 ]


Zheng Shao edited comment on HIVE-352 at 4/23/09 3:14 AM:
----------------------------------------------------------

Running Yongqiang's tests with hadoop native library, using DefaultCodec for 
both RCFile and SequenceFile. The file is on local file system.

It seems RCFile's read performance is around 2 times of that of SequenceFiles, 
probably because we do bulk decompression and one less copy of data.
This result looks reasonable. 
{code}
Write RCFile with 80 random string columns and 100000 rows cost 25464 
milliseconds. And the file's on disk size is 91874941
Write SequenceFile with 80 random string columns and 100000 rows cost 35711 
milliseconds. And the file's on disk size is 102521005
Read only one column of a RCFile with 80 random string columns and 100000 rows 
cost 594 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 
100000 rows cost 600 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 
2227 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4343 
milliseconds.
{code}

This is the result using GzipCodec. Not much difference.
{code}
Write RCFile with 80 random string columns and 100000 rows cost 26358 
milliseconds. And the file's on disk size is 91931563
Write SequenceFile with 80 random string columns and 100000 rows cost 35802 
milliseconds. And the file's on disk size is 102528154
Read only one column of a RCFile with 80 random string columns and 100000 rows 
cost 593 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 
100000 rows cost 626 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 
2401 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4601 
milliseconds.
{code}

Each column is a random string length uniformly from 0 to 30, containing random 
uppercase and lowercase alphabets.


      was (Author: zshao):
    Running Yongqiang's tests with hadoop native library, using DefaultCodec 
for both RCFile and SequenceFile.

It seems RCFile's read performance is around 2 times of that of SequenceFiles, 
probably because we do bulk decompression and one less copy of data.
This result looks reasonable. 
{code}
Write RCFile with 80 random string columns and 100000 rows cost 25464 
milliseconds. And the file's on disk size is 91874941
Write SequenceFile with 80 random string columns and 100000 rows cost 35711 
milliseconds. And the file's on disk size is 102521005
Read only one column of a RCFile with 80 random string columns and 100000 rows 
cost 594 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 
100000 rows cost 600 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 
2227 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4343 
milliseconds.
{code}

This is the result using GzipCodec. Not much difference.
{code}
Write RCFile with 80 random string columns and 100000 rows cost 26358 
milliseconds. And the file's on disk size is 91931563
Write SequenceFile with 80 random string columns and 100000 rows cost 35802 
milliseconds. And the file's on disk size is 102528154
Read only one column of a RCFile with 80 random string columns and 100000 rows 
cost 593 milliseconds.
Read only first and last columns of a RCFile with 80 random string columns and 
100000 rows cost 626 milliseconds.
Read all columns of a RCFile with 80 random string columns and 100000 rows cost 
2401 milliseconds.
Read SequenceFile with 80  random string columns and 100000 rows cost 4601 
milliseconds.
{code}

Each column is a random string length uniformly from 0 to 30, containing random 
uppercase and lowercase alphabets.

  
> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, 
> Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HIVE-352) Make Hive support column based storage

Reply via email to