[ 
https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1521.
------------------------------
    Resolution: Won't Fix

I assume this is obsolete or else already implemented in some sense by tungsten

> Take character set size into account when compressing in-memory string columns
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1521
>                 URL: https://issues.apache.org/jira/browse/SPARK-1521
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 1.1.0
>            Reporter: Cheng Lian
>              Labels: compression
>
> Quoted from [a blog 
> post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/]
>  from Facebook:
> bq. Strings dominate the largest tables in our warehouse and make up about 
> 80% of the columns across the warehouse, so optimizing compression for string 
> columns was important. By using a threshold on observed number of distinct 
> column values per stripe, we modified the ORCFile writer to apply dictionary 
> encoding to a stripe only when beneficial. Additionally, we sample the column 
> values and take the character set of the column into account, since a small 
> character set can be leveraged by codecs like Zlib for good compression and 
> dictionary encoding then becomes unnecessary or sometimes even detrimental if 
> applied.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to