[ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160687#comment-13160687 ]
alex gemini commented on HIVE-2097: ----------------------------------- selectivity play an important role in columnar database is because they use run-length encoding compression to compress most dimension-attribute column,for example,we have a log table:create table (gender,age,region,message),we know that the selectivity order is :gender=1/2 > age= 1/20 >1/300, we can order table column like #1(gender,age,region,message) or #2(region,age,gender,message). for #1,we only need (2 + 2*20 + 2*20*300 +num_of_message) to store all the record in one dfs block, but if we organized table like #2,we will need (300 + 300*20 + 300*20*2 + num_of_message),discard num_of_message,the #1 is only take 66% of space #2 required,only difference is because run-length encoding will take more efficiently space when we organize table base on selectivity. > Explore mechanisms for better compression with RC Files > ------------------------------------------------------- > > Key: HIVE-2097 > URL: https://issues.apache.org/jira/browse/HIVE-2097 > Project: Hive > Issue Type: Improvement > Components: Query Processor, Serializers/Deserializers > Reporter: Krishna Kumar > Assignee: Krishna Kumar > Priority: Minor > > Optimization of the compression mechanisms used by RC File to be explored. > Some initial ideas > > 1. More efficient serialization/deserialization based on type-specific and > storage-specific knowledge. > > For instance, storing sorted numeric values efficiently using some delta > coding techniques > 2. More efficient compression based on type-specific and storage-specific > knowledge > Enable compression codecs to be specified based on types or individual > columns > 3. Reordering the on-disk storage for better compression efficiency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira