Github user liancheng commented on a diff in the pull request:
https://github.com/apache/spark/pull/5348#discussion_r27770121
--- Diff: docs/sql-programming-guide.md ---
@@ -1034,6 +1034,79 @@ df3.printSchema()
</div>
+### Hive metastore Parquet table conversion
+
+When reading from and writing to Hive metastore Parquet tables, Spark SQL
will try to use its own
+Parquet support instead of Hive SerDe for better performance. This
behavior is controlled by the
+`spark.sql.hive.convertMetastoreParquet` configuration, and is turned on
by default.
+
+#### Hive/Parquet Schema Reconciliation
+
+There are two key differences between Hive and Parquet from the
perspective of table schema
+processing.
+
+1. Hive is case insensitive, while Parquet is not
+1. Hive considers all columns nullable, while nullability in Parquet is
significant
+
+Due to this reason, we must reconcile Hive metastore schema with Parquet
schema when converting a
+Hive metastore Parquet table to a Spark SQL Parquet table. The
reconciliation rules are:
+
+1. Fields that have the same name in both schema must have the same data
type regardless of
+ nullability. The reconciled field should have the data type of the
Parquet side, so that
+ nullability is respected.
+
+1. The reconciled schema contains exactly those fields defined in Hive
metastore schema.
+
+ - Any fields that only appear in the Parquet schema are dropped in the
reconciled schema.
+ - Any fileds that only appear in the Hive metastore schema are added as
nullable field in the
+ reconciled schema.
+
+#### Metadata Refreshing
+
+Spark SQL caches Parquet metadata for better performance. When Hive
metastore Parquet table
--- End diff --
Agree, missing such a section is part of the reason why I put the metadata
refreshing section here...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]