[ https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiangrui Meng updated SPARK-16073: ---------------------------------- Description: Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet data. However, Parquet also has its own encodings to compress columns/arrays, e.g., dictionary encoding: https://github.com/apache/parquet-format/blob/master/Encodings.md. It might be worth checking the performance overhead of Parquet encodings on saving large primitive arrays, which is a machine learning use case. If the overhead is significant, we should expose a configuration in Spark to control the encoding levels. Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. was: Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet data. However, Parquet also has its own encodings to compress columns/arrays, e.g., dictionary encoding: https://github.com/apache/parquet-format/blob/master/Encodings.md. It might be worth checking the performance overhead of Parquet encodings for saving primitive arrays, which is a machine learning use case. Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. > Performance of Parquet encodings on saving primitive arrays > ----------------------------------------------------------- > > Key: SPARK-16073 > URL: https://issues.apache.org/jira/browse/SPARK-16073 > Project: Spark > Issue Type: Task > Components: MLlib, SQL > Affects Versions: 2.0.0 > Reporter: Xiangrui Meng > > Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet > data. However, Parquet also has its own encodings to compress columns/arrays, > e.g., dictionary encoding: > https://github.com/apache/parquet-format/blob/master/Encodings.md. > It might be worth checking the performance overhead of Parquet encodings on > saving large primitive arrays, which is a machine learning use case. If the > overhead is significant, we should expose a configuration in Spark to control > the encoding levels. > Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org