[jira] Commented: (HIVE-352) Make Hive support column based storage

He Yongqiang (JIRA) Thu, 23 Apr 2009 01:14:10 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12701837#action_12701837
 ]


He Yongqiang commented on HIVE-352:
-----------------------------------

Thanks, Zheng.
>>0. Did you try that with hadoop 0.17.0? "ant -Dhadoop.version=0.17.0 test" 
>>etc.
yes.
>>1. Can you add your tests to ant, or post the testing scripts so that 
>>everybody can easily reproduce the test results that you have got?
I will do that with next patch
>>2. For DistributedFileSystem, how big is the cluster? Is the file (the file 
>>size is small so it's clearly a single block) local?
The cluster is of six nodes. The file is not local. The test was run on my 
local machine, and use HDFS.
>>3. It seems SequenceFile's compression is not as good as RCFile, although the 
>>data is the same and also random. What is the exact record format in 
>>sequencefile? Did you >>put delimitors or you put length of Strings?
yes, it has length of Strings.However, RCFile also has the length of strings
>>The approach of store compressed data at creation, and do bulk decompression 
>>at reading is not practical because it's very easy to go out of memory.
Yes, I encountered Out of memory error. So i added some trick in 
RCFile.Writer's append. Like
{noformat}
if ((columnBufferSize + (this.bufferedRecords * this.columnNumber * 2) > 
COLUMNS_BUFFER_SIZE)
          || (this.bufferedRecords >= this.RECORD_INTERVAL)) {
        flushRecords();
      }
{noformat}

>>We've done BULK, and it showed great performance (1.6s to read and decompress 
>>40MB local file), but I suspect the compression ratio will be lower than 
>>NONBULK.
>>Can you compare the compression ratio of BULK and NONBULK, given different 
>>buffer sizes and column numbers?
BULK and NONBULK( they mean decompress) are only for Read, they have nothing to 
do with Write, so I guess it will not influence compression ratio.

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to