[
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682716#action_12682716
]
Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------
thanks for taking this on. this could be pretty awesome.
traditionally the arguments for columnar storage has been limited 'scan
bandwidth' and compression. In practice - we see that scan bandwidth has two
components:
1. disk/file-system bandwidth to read data
2. compute cost to scan data
most columnar stores optimize for both (especially because in shared disk
architectures - #1 is at premium). However - our limited experience says is
that in Hadoop #1 is almost infinite. #2 can still be a bottleneck though. (it
is possible that this observation applies because of high hadoop/java compute
overheads - regardless - this seems to be reality).
Given this - i like the idea of a scheme where columns are stored as
independent streams inside a block oriented file format (each file block
contains a set of rows, however - the organization inside blocks is by column).
This does not optimize for #1 - but does optimize for #2 (potentially in
conjunction with Hive's interfaces to get one column at a time from IO
Libraries). It also gives us nearly equivalent compression.
(The alternative scheme of having different file(s) per column is also
complicated by the fact that locality is almost impossible to ensure and there
is no reasonable ways of asking hdfs to colocate different file segments in the
near future).
--
i would love to understand how you are planning to approach this. will we still
use sequencefiles as a container - or should we ditch it? (it wasn't a great
fit for hive - given that we don't use the key field - but the best thing we
could find). We have seen that having a number of open codecs can hurt in
memory usage - that's one open question for me - can we actually afford to open
N concurrent compressed streams (assuming each column is stored compressed
separately).
It also seems that one could define a ColumnarInputFormat/OutputFormat as a
generic api with different implementations and different pluggable containers
underneath - and a scheme of either file per column or columnar in a block
approach. in that sense we could build something more generic for hadoop (and
then just make sure that hive's lazy serde uses the columnar api for data
access - instead of the row based api exposed by current inputformat).
> Make Hive support column based storage
> --------------------------------------
>
> Key: HIVE-352
> URL: https://issues.apache.org/jira/browse/HIVE-352
> Project: Hadoop Hive
> Issue Type: New Feature
> Reporter: he yongqiang
>
> column based storage has been proven a better storage layout for OLAP.
> Hive does a great job on raw row oriented storage. In this issue, we will
> enhance hive to support column based storage.
> Acctually we have done some work on column based storage on top of hdfs, i
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.