[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

vkhristenko Wed, 13 Sep 2017 02:49:26 -0700

Github user vkhristenko commented on the issue:

    https://github.com/apache/spark/pull/16578
  
    Hi,
    
    My name is Viktor and I'm working at CERN on ROOT I/O DataSource for JVM 
and an interface for Spark. ROOT I/O is the format that is used for CERN's LHC 
data.
    ROOT data format (a columnar data format) is similar to Parquet format and 
is subject to pruning nested fields as well. I'm new to contributing to Apache 
Spark and this is why I'm writing this all explicitly.
    
     - I found that this PR is more general than just Parquet!
     - By using my source with this PR, buildReader function, 
https://github.com/diana-hep/spark-root/blob/master/src/main/scala/org/dianahep/sparkroot/experimental/package.scala#L86
 , receives only the schema that is required by the df.select("") statement
     - There is a minor change needed though, "parquetFormat: 
ParquetFileFormat" should be replaced by "fileFormat: FileFormat" as there is 
no dependency on the actual ParquetFileFormat class defined in parquet package
    
https://github.com/apache/spark/pull/16578/files?diff=unified#diff-3bad814b3336a83f360d7395bd740759R38
    
     - And may be renaming this ParquetSchemaPruning and taking it outside of 
the parquet package as it is quite more general than just for parquet, 
otherwise I have to add a special Rule here, 
https://github.com/apache/spark/pull/16578/files?diff=unified#diff-2370d8ed85930c93ef8e5ce67abca53fR35
  ???
    
    Thanks!
    
    VK



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #16578: [SPARK-4502][SQL] Parquet nested column pruning

Reply via email to