[GitHub] spark pull request: [SPARK-6910] [WiP] Reduce number of operations...

liancheng Wed, 01 Jul 2015 00:58:24 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/7049#discussion_r33655995
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala ---
    @@ -361,13 +355,15 @@ private[sql] class ParquetRelation2(
             rawFooters.map(footer => footer.getFile -> footer).toMap
           }
     
    +      footers = new RelationMemo(getFooters())
    --- End diff --
    
    We've recently upgraded Parquet from 1.6.0rc3 to 1.7.0, which doesn't 
require read footers from the driver side any more if a global arbitratie 
schema is already available.
    
    I'm working on refactoring this part by:
    
    1. Don't read footers on driver side to compute Parquet file splits
    2. When no schema is provided and schema merging is not enabled, read the 
footer of a single Parquet file to figure out the schema. If the summary file 
is available, read it. Otherwise, pick a random data file.
    3. When no schema is provided and schema merging is enabled, read footers 
from all data files with a Spark job to figure out the schema.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-6910] [WiP] Reduce number of operations...

Reply via email to