GitHub user budde opened a pull request:

    https://github.com/apache/spark/pull/16944

    [SPARK-19611][SQL] Introduce configurable table schema inference

    *Update: Accidentally broke #16942 via a force push. Opening a replacement 
PR.*
    
    Replaces #16797. See the discussion in this PR for more 
details/justification for this change.
    
    ## Summary of changes
    [JIRA for this change](https://issues.apache.org/jira/browse/SPARK-19611)
    
    - Add spark.sql.hive.schemaInferenceMode param to SQLConf
    - Add schemaFromTableProps field to CatalogTable (set to true when schema is
      successfully read from table props)
    - Perform schema inference in HiveMetastoreCatalog if schemaFromTableProps 
is
      false, depending on spark.sql.hive.schemaInferenceMode.
    - Update table metadata properties in HiveExternalCatalog.alterTable()
    - Add HiveSchemaInferenceSuite tests
    
    ## How was this patch tested?
    
    The tests in HiveSchemaInferenceSuite should verify that schema inference 
is working as expected.
    
    ## Open issues
    
    - The option values for ```spark.sql.hive.schemaInferenceMode ``` (e.g. 
"INFER_AND_SAVE", "INFER_ONLY", "NEVER_INFER") should be made into constants or 
an enum. I couldn't find a sensible object to place them in though that doesn't 
introduce a dependency between sql/core and sql/hive.
    - Should "INFER_AND_SAVE" be the default behavior? This restores the 
out-of-the-box compatibility that was present prior to 2.1.0 but changes the 
behavior of 2.1.0 (which is essentially "NEVER_INFER").
    - Is ```HiveExternalCatalog.alterTable()``` the appropriate place to write 
back the table metadata properties outside of createTable()? Should a new 
external catalog method like updateTableMetadata() be introduced?
    - All partition columns will still be treated as case-insensitive even 
after inferring. As far as I remember, this has always been the case with 
schema inference prior to Spark 2.1.0 and I haven't made any attempts to 
reconcile this since it doesn't cause the same problems that case sensitive 
data fields do. Should we attempt to restore case sensitivity by inspecting 
file paths or leave this as-is?

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/budde/spark SPARK-19611

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16944.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16944
    
----

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to