[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16910655#comment-16910655 ]
Nicholas Chammas edited comment on SPARK-4502 at 8/19/19 7:55 PM: ------------------------------------------------------------------ Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be enabled by default as part of SPARK-27644. -With regards to aggregates breaking pruning, have you reported that somewhere? If not, I recommend reporting it and linking to the new issue from here.- Looks like the problem with aggregates breaking schema pruning is already being tracked in SPARK-27217. was (Author: nchammas): Thanks for your notes [~Bartalos]. Just FYI, nested schema pruning is set to be enabled by default as part of SPARK-27644. With regards to aggregates breaking pruning, have you reported that somewhere? If not, I recommend reporting it and linking to the new issue from here. > Spark SQL reads unneccesary nested fields from Parquet > ------------------------------------------------------ > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 1.1.0 > Reporter: Liwen Sun > Assignee: Michael Allman > Priority: Critical > Fix For: 2.4.0 > > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org