[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683348#action_12683348 ]
Joydeep Sen Sarma commented on HIVE-352: ---------------------------------------- >B2.2 is easier to implement, because we don't have the problem of splitting >different columns of the same block into multiple mappers. for B2.1 - we may be able to control when sequencefile writes out sync markers (or at least we should investigate if that's easy enough to do by extending SequenceFile). the advantage of avoiding reading specific columns seems pretty significant. OTOH - one can also easily imagine that SequenceFile does not copy data into a BytesWritable - rather that we have a special Writable structure such that when the read on it is invoked - it just copies the reference to the underlying byte buffer. that way there are no copies of data in sequencefile reader and the application (in this case the columnar format reader) - is able to skip to the relevant sections of data without touching the irrelevant columns. if we do it this way - B2.2 has no performance downside. regarding the compression related questions raised by Yongqiang - it seems to me that trying out the most generic compression algorithm (gzip) is better - trying to specify or infer best compression technique per column much harder and something that can be done later. one thing we could do to mitigate the number of open codecs is to simply accumulate all the data uncompressed in a buffer per column and then do the compression in one shot at the end (once we think enough data is accumulated) using just one codec object. this obviously seems non optimal from the point of view of having to scan data multple times - OTOH - there were known issues with older versions of hadoop with lots of open codecs. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.