[jira] Updated: (HIVE-352) Make Hive support column based storage

He Yongqiang (JIRA) Wed, 22 Apr 2009 05:45:13 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


He Yongqiang updated HIVE-352:
------------------------------

    Attachment: 4-22 performace2.txt
                hive-352-2009-4-22-2.patch

According to Zheng's suggestions, hive-352-2009-4-22-2.patch made several 
improvements againds hive-352-2009-4-22.patch:
1) let each row data randomly produced in the test be string bytes( the 
previous one produces binary bytes) 
2) add correctness parameter in performance test to allow test what we read are 
what we wrote ( in PerformTestRCFileAndSeqFile).

4-22 performace2.txt added more detailed test results:
1. local using bulk decompression,in RCFile->ValueBuffer->readFields(), like:
{noformat}
bufferRef.write(valueIn, columnBlockPlainDataLength);
{noformat}
2. locak not using bulk decompression,in RCFile->ValueBuffer->readFields(), 
like:
{noformat}
while(deflateFilter.available()>0)
bufferRef.write(valueIn, 1);
{noformat}
3. using DistributedFileSystem and bulk decompression, the tests are still run 
on my local machine

Here are the brief results(more detail, pls take a look attached 4-22 
performace2.txt)

1.
(LocalFileSystem)Use Bulk decompression in RCFile->ValueBuffer->ReadFileds, and 
adds some noisy between two RCFile reading, and after written to avoid disk 
cache.

column number| RCFile size | RCFile read 1 column | RCFile read 2 column | 
RCFile read all columns |Sequence file size | sequence file read all

10| 11501112| 259| 181| 498| 13046020| 7002
25| 28725817| 233| 269| 1082| 32246409| 16539
40| 45940679| 261| 301| 1698| 51436799| 25415

2.
(LocalFileSystem)Not bulk decompression in RCFile->ValueBuffer->readFileds, and 
the test adds some noisy between two RCFile reading, and after written to avoid 
disk cache.

column number| RCFile size | RCFile read 1 column | RCFile read 2 column | 
RCFile read all columns |Sequence file size | sequence file read all
10| 11501112| 1804 | 3262 | 15956 | 13046020| 6927
25| 28725817| 1761 | 3310 | 39492 | 32246409| 15983
40| 45940679| 1843 | 3386 | 63759 | 51436799| 25256

3.
(DistributedFileSystem)Use Bulk decompression in 
RCFile->ValueBuffer->readFileds, and adds some noisy between two RCFile 
reading, and after written to avoid disk cache.

column number| RCFile size | RCFile read 1 column | RCFile read 2 column | 
RCFile read all columns |Sequence file size | sequence file read all

10| 11501112| 2381| 3516| 9898| 13046020| 18053
25| 28725817| 3754| 5254| 22521| 32246409| 43258
40| 45940679| 5597| 8225| 40304| 51436799| 69278


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HIVE-352) Make Hive support column based storage

Reply via email to