[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

cloud-fan Wed, 18 Apr 2018 01:34:28 -0700

Github user cloud-fan commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    I only looked at the PR description, here are my 2 cents:
    
    Currently column pruning is done with 2 steps in Spark: 1) optimizer 
generates extra `Project` to prune unnecessary columns as bottom as possible, 
to reduce the data size between operators. 2) planner extract required columns 
and push it to data sources. 
    
    The first step is generally useful even if the data source doesn't support 
column pruning, because we can reduce data size between operators(e.g. 
shuffle). I think it's also true for nested column pruning.
    
    We can implement nested pruning with 2 PRs:
    1. improve the current column pruning rule(or add a new rule) to prune 
nested columns as bottom as possible
    2. improve the planner rule to extract the required nested columns and push 
to parquet.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to