[ https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15740807#comment-15740807 ]
Kazuaki Ishizaki commented on SPARK-16073: ------------------------------------------ It is an interesting topic. In the current situation, SPARK-16043 will not be merged soon. This is because performance issues for DataFrame/Dataset programs with primitive arrays are addressed by other approaches. If there are some bench programs for this measurement, I am happy to run them with SPARK-16043. Are there any benchmark programs? > Performance of Parquet encodings on saving primitive arrays > ----------------------------------------------------------- > > Key: SPARK-16073 > URL: https://issues.apache.org/jira/browse/SPARK-16073 > Project: Spark > Issue Type: Task > Components: MLlib, SQL > Affects Versions: 2.0.0 > Reporter: Xiangrui Meng > > Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet > data. However, Parquet also has its own encodings to compress columns/arrays, > e.g., dictionary encoding: > https://github.com/apache/parquet-format/blob/master/Encodings.md. > It might be worth checking the performance overhead of Parquet encodings on > saving large primitive arrays, which is a machine learning use case. If the > overhead is significant, we should expose a configuration in Spark to control > the encoding levels. > Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org