[
https://issues.apache.org/jira/browse/SPARK-1521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sean Owen resolved SPARK-1521.
------------------------------
Resolution: Won't Fix
I assume this is obsolete or else already implemented in some sense by tungsten
> Take character set size into account when compressing in-memory string columns
> ------------------------------------------------------------------------------
>
> Key: SPARK-1521
> URL: https://issues.apache.org/jira/browse/SPARK-1521
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 1.1.0
> Reporter: Cheng Lian
> Labels: compression
>
> Quoted from [a blog
> post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/]
> from Facebook:
> bq. Strings dominate the largest tables in our warehouse and make up about
> 80% of the columns across the warehouse, so optimizing compression for string
> columns was important. By using a threshold on observed number of distinct
> column values per stripe, we modified the ORCFile writer to apply dictionary
> encoding to a stripe only when beneficial. Additionally, we sample the column
> values and take the character set of the column into account, since a small
> character set can be leveraged by codecs like Zlib for good compression and
> dictionary encoding then becomes unnecessary or sometimes even detrimental if
> applied.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]