[jira] Commented: (HIVE-352) Make Hive support column based storage

Joydeep Sen Sarma (JIRA) Tue, 17 Mar 2009 09:43:19 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682716#action_12682716
 ]


Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------

thanks for taking this on. this could be pretty awesome.

traditionally the arguments for columnar storage has been limited 'scan 
bandwidth' and compression. In practice - we see that scan bandwidth has two 
components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data

most columnar stores optimize for both (especially because in shared disk 
architectures - #1 is at premium). However - our limited experience says is 
that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it 
is possible that this observation applies because of high hadoop/java compute 
overheads - regardless - this seems to be reality).

Given this - i like the idea of a scheme where columns are stored as 
independent streams inside a block oriented file format (each file block 
contains a set of rows, however - the organization inside blocks is by column). 
This does not optimize for #1 - but does optimize for #2 (potentially in 
conjunction with Hive's interfaces to get one column at a time from IO 
Libraries). It also gives us nearly equivalent compression.

(The alternative scheme of having  different file(s) per column is also 
complicated by the fact that locality is almost impossible to ensure and there 
is no reasonable ways of asking hdfs to colocate different file segments in the 
near future).

--

i would love to understand how you are planning to approach this. will we still 
use sequencefiles as a container - or should we ditch it? (it wasn't a great 
fit for hive - given that we don't use the key field - but the best thing we 
could find). We have seen that having a number of open codecs can hurt in 
memory usage - that's one open question for me - can we actually afford to open 
N concurrent compressed streams (assuming each column is stored compressed 
separately).

It also seems that one could define a ColumnarInputFormat/OutputFormat as a 
generic api with different implementations and different pluggable containers 
underneath - and a scheme of either file per column or columnar in a block 
approach. in that sense we could build something more generic for hadoop (and 
then just make sure that hive's lazy serde uses the columnar api for data 
access - instead of the row based api exposed by current inputformat).

> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-352) Make Hive support column based storage

Reply via email to