[jira] Commented: (HIVE-352) Make Hive support column based storage

Zheng Shao (JIRA) Fri, 24 Apr 2009 00:18:56 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702267#action_12702267
 ]


Zheng Shao commented on HIVE-352:
---------------------------------

@Yongqiang,

The reason that native codec matters more for SequenceFile is probably because 
seqfile is using compression differently from rcfile, for example, incremental 
compression/decompression.

I had a test on our data set which mainly contains around 40 columns of string, 
the length of string is usually fixed for that column, from length 1 to 10. The 
result is that seqfile is much smaller than rcfile - seqfile is only around 55% 
of the size of rcfile. However inside the rcfile I see a lot of repeated bytes 
- that's the length of the field for each row.  Also rcfile is slower probably 
because it's writing out more data than seqfile.

1. Can you also compress the field length columns? I tried to compress the 
rcfile again using gzip command line, and it becomes 41% of the current size - 
this is a lot smaller than the seqfile, which means in general, RCfile can save 
a lot of space because it's easier for compression algorithm to compress the 
length and the content of each column separately.

2. Also, I remember you changed the the compression to be incremental, so the 
current solution is a mix of BULK and NONBULK as I described above. which has 
memory problems  Since as we discussed we would like to leave the NONBULK mode 
for later because of the amount of additional work, can you change the code 
back to BULK compression?  There is probably a performance loss due to 
incremental compression, which can be avoided by bulk compression.


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, 
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, 
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, 
> hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch, 
> Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to