Github user cloud-fan commented on the issue: https://github.com/apache/spark/pull/16578 I only looked at the PR description, here are my 2 cents: Currently column pruning is done with 2 steps in Spark: 1) optimizer generates extra `Project` to prune unnecessary columns as bottom as possible, to reduce the data size between operators. 2) planner extract required columns and push it to data sources. The first step is generally useful even if the data source doesn't support column pruning, because we can reduce data size between operators(e.g. shuffle). I think it's also true for nested column pruning. We can implement nested pruning with 2 PRs: 1. improve the current column pruning rule(or add a new rule) to prune nested columns as bottom as possible 2. improve the planner rule to extract the required nested columns and push to parquet.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org