[ 
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158484#comment-13158484
 ] 

Krishna Kumar commented on HIVE-2097:
-------------------------------------

Thanks Alex for the suggestions.

Just to be sure we are on the same page, I believe you are talking about #3 
approach above given in the description which aligns with the ideas in the 
comment from He Yongqiang. I have been working on implementing #1 and #2 
currently.

Re #3 approaches, column grouping and row reordering are the general idea, but 
I do not understand your point re column selectivity. Why should selectivity 
play a role here where any grouping/reordering is done for better compression? 
There are two effects which we can exploit for better compression within column 
grouping (a) when the values in the two columns are similar and (b) where the 
values are correlated, that is, using conditional probabilities for better 
compression. In either case, my hope was that we would be able to create 
type-specific compressors for structs/maps etc which can exploit these 
features, i.e., a struct/map acts as a column group for compression purposes.


                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and 
> storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta 
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific 
> knowledge
>    Enable compression codecs to be specified based on types or individual 
> columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to