[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683293#action_12683293
 ] 

Zheng Shao commented on HIVE-352:
---------------------------------

Hi Yongqiang,

Sorry for jumping on this issue late.


Let me summaries the choices that we have to make:

A. Put different columns in different files (we can still have column-set - a 
bunch of columns in the same file)

B. Put different columns in the same file, but organize it in a block-based 
way. In a single block, the first column of all rows are in the front, then the 
second column, etc. 
B1: Write a new FileFormat
B2: Continue to use SequenceFileFormat
B2.1: Store a block in multiple records, one record for each column. Use the 
key to label the beginning of a block (or column id).
B2.2: Store a block in a single record


Comparing A and B: 1. B is much easier to implement than A. Hadoop jobs take 
files as input. If the data is stored in a single file, it's much easier to 
either read or write to the file. 2. B may have the advantage of locality. 3. B 
may require a little bit more memory buffer for writing. 4. B may not be as 
efficient as A in reading since all data need to be read (unless the FileFormat 
supports "skip" but that might create more random seeks depending on block 
size).

Comparing B1 and B2: 1. B1 is much more flexible since we can do whatever we 
want (especially skip-reading etc); 2. B2 is much easier to do and we naturally 
enjoy all benefits of SequenceFile: splittable, customizable compression codec.

Comparing B2.1 and B2.2: 1. B2.2 is easier to implement, because we don't have 
the problem of splitting different columns of the same block into multiple 
mappers. 2. B2.1 is potentially more efficient when we allow SequenceFile to 
skip record and ask Hive to tell us which of the columns can be skipped.

As a result, I would suggest to try B2.2 as the first exercise, then try B2.1, 
then B1, then A.

The amount of work for each level (B2.2, B2.1, B1, A) will probably differ by a 
factor of 3-5. So it does not hurt much by starting from B2.2, and also the 
first steps will be good learning steps for the next ones.

Thoughts?


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to