[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

liancheng Sat, 04 Apr 2015 09:32:37 -0700

Github user liancheng commented on a diff in the pull request:

    https://github.com/apache/spark/pull/5348#discussion_r27770121
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -1034,6 +1034,79 @@ df3.printSchema()
     
     </div>
     
    +### Hive metastore Parquet table conversion
    +
    +When reading from and writing to Hive metastore Parquet tables, Spark SQL 
will try to use its own
    +Parquet support instead of Hive SerDe for better performance. This 
behavior is controlled by the
    +`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on 
by default.
    +
    +#### Hive/Parquet Schema Reconciliation
    +
    +There are two key differences between Hive and Parquet from the 
perspective of table schema
    +processing.
    +
    +1. Hive is case insensitive, while Parquet is not
    +1. Hive considers all columns nullable, while nullability in Parquet is 
significant
    +
    +Due to this reason, we must reconcile Hive metastore schema with Parquet 
schema when converting a
    +Hive metastore Parquet table to a Spark SQL Parquet table.  The 
reconciliation rules are:
    +
    +1. Fields that have the same name in both schema must have the same data 
type regardless of
    +   nullability.  The reconciled field should have the data type of the 
Parquet side, so that
    +   nullability is respected.
    +
    +1. The reconciled schema contains exactly those fields defined in Hive 
metastore schema.
    +
    +   - Any fields that only appear in the Parquet schema are dropped in the 
reconciled schema.
    +   - Any fileds that only appear in the Hive metastore schema are added as 
nullable field in the
    +     reconciled schema.
    +
    +#### Metadata Refreshing
    +
    +Spark SQL caches Parquet metadata for better performance.  When Hive 
metastore Parquet table
    --- End diff --
    
    Agree, missing such a section is part of the reason why I put the metadata 
refreshing section here...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [Doc] [SQL] Addes Hive metastore Parquet table...

Reply via email to