Cheng Lian created SPARK-1521:
---------------------------------

             Summary: Take character set size into account when compressing 
in-memory string columns
                 Key: SPARK-1521
                 URL: https://issues.apache.org/jira/browse/SPARK-1521
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 1.1.0
            Reporter: Cheng Lian


Quoted from [a blog 
post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/]
 from Facebook:

bq. Strings dominate the largest tables in our warehouse and make up about 80% 
of the columns across the warehouse, so optimizing compression for string 
columns was important. By using a threshold on observed number of distinct 
column values per stripe, we modified the ORCFile writer to apply dictionary 
encoding to a stripe only when beneficial. Additionally, we sample the column 
values and take the character set of the column into account, since a small 
character set can be leveraged by codecs like Zlib for good compression and 
dictionary encoding then becomes unnecessary or sometimes even detrimental if 
applied.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to