[
https://issues.apache.org/jira/browse/HIVE-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014272#comment-13014272
]
He Yongqiang commented on HIVE-2065:
------------------------------------
The column-specific compression is very interesting, but it is not directly
related to make RCFile compatible with Seqfile. We can still do that without
this compatibility.
Some inputs maybe useful to you:
we examined column groups, and sort the data internally based on one column in
one column group. (But we did not try different compressions across column
groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one
table compared the previous RCFile. The main problems for this approach is that
it is hard to find out the correct/most efficient column group definitions.
One example, table tbl_1 has 20 columns, and user can define:
col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;
This will put col_1, col_2,col_11, col_13 into one column group, and reorder
that column group based on sorting col_1 (0 is the first column in this column
group), and put col_3, col_4, col_15,col_16 into another column group, and
reorder this column group based on sorting col_4, and finally put all other
columns into the default column group with original order.
And should be easy to allow different compression codec for different column
groups.
The main block issue for this approach is have a full set of utils to find out
the best column group definition.
Instead of doing that in the existing RCFile, do you think it would be better
if we can explore it in the new one that i just mentioned. If you think
interesting, we can share you the existing code that we have for things i
mentioned. And you can work on the compression codec based on the new one, and
provide a util tool to find out the best column group definition.
what do you think?
> RCFile issues
> -------------
>
> Key: HIVE-2065
> URL: https://issues.apache.org/jira/browse/HIVE-2065
> Project: Hive
> Issue Type: Bug
> Reporter: Krishna Kumar
> Assignee: Krishna Kumar
> Priority: Minor
> Attachments: HIVE.2065.patch.0.txt, Slide1.png, proposal.png
>
>
> Some potential issues with RCFile
> 1. Remove unwanted synchronized modifiers on the methods of RCFile. As per
> yongqiang he, the class is not meant to be thread-safe (and it is not). Might
> as well get rid of the confusing and performance-impacting lock acquisitions.
> 2. Record Length overstated for compressed files. IIUC, the key compression
> happens after we have written the record length.
> {code}
> int keyLength = key.getSize();
> if (keyLength < 0) {
> throw new IOException("negative length keys not allowed: " + key);
> }
> out.writeInt(keyLength + valueLength); // total record length
> out.writeInt(keyLength); // key portion length
> if (!isCompressed()) {
> out.writeInt(keyLength);
> key.write(out); // key
> } else {
> keyCompressionBuffer.reset();
> keyDeflateFilter.resetState();
> key.write(keyDeflateOut);
> keyDeflateOut.flush();
> keyDeflateFilter.finish();
> int compressedKeyLen = keyCompressionBuffer.getLength();
> out.writeInt(compressedKeyLen);
> out.write(keyCompressionBuffer.getData(), 0, compressedKeyLen);
> }
> {code}
> 3. For sequence file compatibility, the compressed key length should be the
> next field to record length, not the uncompressed key length.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira