[ 
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157705#comment-13157705
 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

a few suggestion:
In columnar database,they always organize column order in the "high selectivity 
come first" way,then In each column,they store each value in sorted way.
case 1:if we already know the pattern of each column in big datasets,for 
example we can calculated in database to get a sample column distribution.we 
need to know the distinct value of each column value.in create database 
statement 
create table a
(col1,col2,col3,col4,col5,col6,xxx)
TBLPROPERTIES 
(col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005)
when we organize column group,we know which column is most high selectivity.in 
this example,the selectivity order of table a is : 
col3>col4>col2>col6>col5>col1 ,so we can organize column group like 
(col3,col4,col2),(col6,col5),col1
case 2:if we didn't know the table properties when we create table.we can just 
store them like normally,then provide a utility like hive --service 
rcfile_reorder 'some_hive_table_here', when execute this command,submit several 
mapreduce job to calculate the selectivity of each column and store them in 
metastore.then decompression each rcfile to reorganized them in a more space 
efficience column group.
hope this help.
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and 
> storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta 
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific 
> knowledge
>    Enable compression codecs to be specified based on types or individual 
> columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to