GitHub user budde opened a pull request:
https://github.com/apache/spark/pull/16944
[SPARK-19611][SQL] Introduce configurable table schema inference
*Update: Accidentally broke #16942 via a force push. Opening a replacement
PR.*
Replaces #16797. See the discussion in this PR for more
details/justification for this change.
## Summary of changes
[JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
- Add spark.sql.hive.schemaInferenceMode param to SQLConf
- Add schemaFromTableProps field to CatalogTable (set to true when schema is
successfully read from table props)
- Perform schema inference in HiveMetastoreCatalog if schemaFromTableProps
is
false, depending on spark.sql.hive.schemaInferenceMode.
- Update table metadata properties in HiveExternalCatalog.alterTable()
- Add HiveSchemaInferenceSuite tests
## How was this patch tested?
The tests in HiveSchemaInferenceSuite should verify that schema inference
is working as expected.
## Open issues
- The option values for ```spark.sql.hive.schemaInferenceMode ``` (e.g.
"INFER_AND_SAVE", "INFER_ONLY", "NEVER_INFER") should be made into constants or
an enum. I couldn't find a sensible object to place them in though that doesn't
introduce a dependency between sql/core and sql/hive.
- Should "INFER_AND_SAVE" be the default behavior? This restores the
out-of-the-box compatibility that was present prior to 2.1.0 but changes the
behavior of 2.1.0 (which is essentially "NEVER_INFER").
- Is ```HiveExternalCatalog.alterTable()``` the appropriate place to write
back the table metadata properties outside of createTable()? Should a new
external catalog method like updateTableMetadata() be introduced?
- All partition columns will still be treated as case-insensitive even
after inferring. As far as I remember, this has always been the case with
schema inference prior to Spark 2.1.0 and I haven't made any attempts to
reconcile this since it doesn't cause the same problems that case sensitive
data fields do. Should we attempt to restore case sensitivity by inspecting
file paths or leave this as-is?
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/budde/spark SPARK-19611
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16944.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16944
----
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]