[ 
https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683348#action_12683348
 ] 

Joydeep Sen Sarma commented on HIVE-352:
----------------------------------------

>B2.2 is easier to implement, because we don't have the problem of splitting 
>different columns of the same block into multiple mappers.

for B2.1 - we may be able to control when sequencefile writes out sync markers 
(or at least we should investigate if that's easy enough to do by extending 
SequenceFile). the advantage of avoiding reading specific columns seems pretty 
significant.

OTOH - one can also easily imagine that  SequenceFile does not copy data into a 
BytesWritable - rather that we have a special Writable structure such that when 
the read on it is invoked - it just copies the reference to the underlying byte 
buffer. that way there are no copies of data in sequencefile reader and the 
application (in this case the columnar format reader) - is able to skip to the 
relevant sections of data without touching the irrelevant columns. if we do it 
this way - B2.2 has no performance downside. 

regarding the compression related questions raised by Yongqiang - it seems to 
me that trying out the most generic compression algorithm (gzip) is better - 
trying to specify or infer best compression technique per column much harder 
and something that can be done later. one thing we could do to mitigate the 
number of open codecs is to simply accumulate all the data uncompressed in a 
buffer per column and then do the compression in one shot at the end (once we 
think enough data is accumulated) using just one codec object.  this obviously 
seems non optimal from the point of view of having to scan data multple times - 
OTOH - there were known issues with older versions of hadoop with lots of open 
codecs. 


> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will 
> enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i 
> think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to