[ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12689041#action_12689041 ]
He Yongqiang commented on HIVE-352: ----------------------------------- Thanks, Joydeep and Prasad. First i would like to make an update to the recent work: I had implemented an initial RCFile which was just a wrapper of SequenceFile, and it relied on Hadoop-5553. Since it seems Hadoop-5553 will not be resolved, I have implemented another RCFile, which copies many code form SequenceFile( especially the Writer code), and provides the same on-disk data layout as SequenceFile. Here is a draft description of the new RCFile: 1) Only record compression or no compression at all. In B2.2 we store a bunch of raw rows into one record in a columnar way. So there is no need for block compression, because block compression will decompress all the data. 2) In-record compression. If the writer is created with compress flag, then the value part in one record is compressed but with a column compression style. The layout is like this: Record length Key length {the below is the Key part} number_of_rows_in_this_record(vint) column_1_ondisk_length(vint),column_1_row_1_value_plain_length, column_1_row_2_value_plain_length,.... column_2_ondisk_length(vint),column_2_row_1_value_plain_length, column_2_row_2_value_plain_length,.... .......... {the end of the key part} {the begin of the value part} Compressed data or plain data of [column_1_row_1_value, column_1_row_2_value,....] Compressed data or plain data of [column_2_row_1_value, column_2_row_2_value,....] {the end of the value part} The key part: KeyBuffer The value part : ValueBuffer 3) the reader It now only provides 2 API: next(LongWritable rowID): returns the next rowid number. I think it should be refined, because the rowid maybe not real rowid, and it is only the already passed rows from the beginning of the reader. List<Bytes> getCurrentRow() will return all the columns raw bytes of one row. Because the reader can let use specify the column ids which should be skipped, so the returned List<Bytes> only contains the unskipped columns bytes. Maybe it is better to store a NullBytes in the returned list to represent a skipped column. > Make Hive support column based storage > -------------------------------------- > > Key: HIVE-352 > URL: https://issues.apache.org/jira/browse/HIVE-352 > Project: Hadoop Hive > Issue Type: New Feature > Reporter: He Yongqiang > > column based storage has been proven a better storage layout for OLAP. > Hive does a great job on raw row oriented storage. In this issue, we will > enhance hive to support column based storage. > Acctually we have done some work on column based storage on top of hdfs, i > think it will need some review and refactoring to port it to Hive. > Any thoughts? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.