Github user budde commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16944#discussion_r104270319
  
    --- Diff: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala ---
    @@ -217,6 +227,62 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
         result.copy(expectedOutputAttributes = Some(relation.output))
       }
     
    +  private def inferIfNeeded(
    +      relation: CatalogRelation,
    +      options: Map[String, String],
    +      fallbackSchema: StructType,
    +      fileFormat: FileFormat,
    +      fileIndexOpt: Option[FileIndex] = None): (StructType, CatalogTable) 
= {
    +    val inferenceMode = 
sparkSession.sessionState.conf.caseSensitiveInferenceMode
    +    val shouldInfer = (inferenceMode != NEVER_INFER) && 
!relation.tableMeta.schemaPreservesCase
    +    val tableName = relation.tableMeta.identifier.table
    +    if (shouldInfer) {
    +      logInfo(s"Inferring case-sensitive schema for table $tableName 
(inference mode: " +
    +        s"$inferenceMode)")
    +      val fileIndex = fileIndexOpt.getOrElse {
    +        val rootPath = new Path(new URI(relation.tableMeta.location))
    +        new InMemoryFileIndex(sparkSession, Seq(rootPath), options, None)
    +      }
    +
    +      val inferredSchema = {
    +        val schema = fileFormat.inferSchema(
    --- End diff --
    
    Let me lay out the scenario that I think ```mergeMetastoreParquetSchema``` 
is meant to guard against:
    
    * We have an external table named ```test_table``` containing the following 
fields:
      * field_one, a nullable String field
      * field_two, a nullable String field
      * partition_col, a String partition column
    
    Now lets say we add the following Parquet file partitions to the table:
    
    * ```part_col=both_values``` where each row in the file contains:
      * a value for ```field_one```
      * a value for ```field_two```
    * ```part_col=one_only``` where each row in the file contains:
      * a value for ```field_one```
      * no ```field_two``` column
    * ```part_col=two_only``` where each row in the file contains:
      * no ```field_one``` column
      * a value for ```field_two```
    
    For now, let's assume that ```inferSchema()``` was invoked after partition 
pruning and will only consider the Parquet files corresponding to the 
partitions being queried against. Let's consider what the resulting DataFrame 
query will be for the following SQL queries:
    
    * ```SELECT * FROM test_table WHERE part_col='both_values'```
    ```scala
    StructType(
      StructField("field_one", StringType, true),
      StructField("field_two", StringType, true)
      StructField("partition_col", StringType, true))
    ```
    * ```SELECT * FROM test_table WHERE part_col='one_only'```
    ```scala
    StructType(
      StructField("field_one", StringType, true),
      StructField("partition_col", StringType, true))
    ```
    * ```SELECT * FROM test_table WHERE part_col='two_only'```
    ```scala
    StructType(
      StructField("field_two", StringType, true),
      StructField("partition_col", StringType, true))
    ```
    
    The ```mergeMetastoreParquetSchema``` method exists to fill in any nullable 
fields obtained from the metastore that weren't present in the underlying 
Parquet files that were inspected. Again, #5214 along with its associated JIRA 
provide a bit more context here.
    
    This particular example is a bit contrived in this context as partition 
pruning is now a lazy operation that happens in the optimization stage and the 
inferred schema should scan all of the partitions that back the table (as far 
as I can tell), but there's still the edge case where all of a table's data 
files may not contain a field that is listed as nullable in the metastore 
schema. A user may write an application with the assumption that such a field 
is present in the DataFrame returned by a query against a table since the 
column is in the metastore which would fail at runtime if no merging occurs.
    
    I can remove this call if you still feel it's not suitable, I just want to 
make clear the reason why it is there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to