Cheng Lian created SPARK-1521:
---------------------------------
Summary: Take character set size into account when compressing
in-memory string columns
Key: SPARK-1521
URL: https://issues.apache.org/jira/browse/SPARK-1521
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 1.1.0
Reporter: Cheng Lian
Quoted from [a blog
post|https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/]
from Facebook:
bq. Strings dominate the largest tables in our warehouse and make up about 80%
of the columns across the warehouse, so optimizing compression for string
columns was important. By using a threshold on observed number of distinct
column values per stripe, we modified the ORCFile writer to apply dictionary
encoding to a stripe only when beneficial. Additionally, we sample the column
values and take the character set of the column into account, since a small
character set can be leveraged by codecs like Zlib for good compression and
dictionary encoding then becomes unnecessary or sometimes even detrimental if
applied.
--
This message was sent by Atlassian JIRA
(v6.2#6252)