[ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157705#comment-13157705 ]
alex gemini commented on HIVE-2097: ----------------------------------- a few suggestion: In columnar database,they always organize column order in the "high selectivity come first" way,then In each column,they store each value in sorted way. case 1:if we already know the pattern of each column in big datasets,for example we can calculated in database to get a sample column distribution.we need to know the distinct value of each column value.in create database statement create table a (col1,col2,col3,col4,col5,col6,xxx) TBLPROPERTIES (col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005) when we organize column group,we know which column is most high selectivity.in this example,the selectivity order of table a is : col3>col4>col2>col6>col5>col1 ,so we can organize column group like (col3,col4,col2),(col6,col5),col1 case 2:if we didn't know the table properties when we create table.we can just store them like normally,then provide a utility like hive --service rcfile_reorder 'some_hive_table_here', when execute this command,submit several mapreduce job to calculate the selectivity of each column and store them in metastore.then decompression each rcfile to reorganized them in a more space efficience column group. hope this help. > Explore mechanisms for better compression with RC Files > ------------------------------------------------------- > > Key: HIVE-2097 > URL: https://issues.apache.org/jira/browse/HIVE-2097 > Project: Hive > Issue Type: Improvement > Components: Query Processor, Serializers/Deserializers > Reporter: Krishna Kumar > Assignee: Krishna Kumar > Priority: Minor > > Optimization of the compression mechanisms used by RC File to be explored. > Some initial ideas > > 1. More efficient serialization/deserialization based on type-specific and > storage-specific knowledge. > > For instance, storing sorted numeric values efficiently using some delta > coding techniques > 2. More efficient compression based on type-specific and storage-specific > knowledge > Enable compression codecs to be specified based on types or individual > columns > 3. Reordering the on-disk storage for better compression efficiency. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira