[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12702267#action_12702267
]
Zheng Shao commented on HIVE-352:
---------------------------------
@Yongqiang,
The reason that native codec matters more for SequenceFile is probably because
seqfile is using compression differently from rcfile, for example, incremental
compression/decompression.
I had a test on our data set which mainly contains around 40 columns of string,
the length of string is usually fixed for that column, from length 1 to 10. The
result is that seqfile is much smaller than rcfile - seqfile is only around 55%
of the size of rcfile. However inside the rcfile I see a lot of repeated bytes
- that's the length of the field for each row. Also rcfile is slower probably
because it's writing out more data than seqfile.
1. Can you also compress the field length columns? I tried to compress the
rcfile again using gzip command line, and it becomes 41% of the current size -
this is a lot smaller than the seqfile, which means in general, RCfile can save
a lot of space because it's easier for compression algorithm to compress the
length and the content of each column separately.
2. Also, I remember you changed the the compression to be incremental, so the
current solution is a mix of BULK and NONBULK as I described above. which has
memory problems Since as we discussed we would like to leave the NONBULK mode
for later because of the amount of additional work, can you change the code
back to BULK compression? There is probably a performance loss due to
incremental compression, which can be avoided by bulk compression.
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: He Yongqiang
> Assignee: He Yongqiang
> Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22
> progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch,
> hive-352-2009-4-17.patch, hive-352-2009-4-19.patch,
> hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch,
> hive-352-2009-4-23.patch, HIve-352-draft-2009-03-28.patch,
> Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.