Github user budde commented on a diff in the pull request: https://github.com/apache/spark/pull/16944#discussion_r104270319 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala --- @@ -217,6 +227,62 @@ private[hive] class HiveMetastoreCatalog(sparkSession: SparkSession) extends Log result.copy(expectedOutputAttributes = Some(relation.output)) } + private def inferIfNeeded( + relation: CatalogRelation, + options: Map[String, String], + fallbackSchema: StructType, + fileFormat: FileFormat, + fileIndexOpt: Option[FileIndex] = None): (StructType, CatalogTable) = { + val inferenceMode = sparkSession.sessionState.conf.caseSensitiveInferenceMode + val shouldInfer = (inferenceMode != NEVER_INFER) && !relation.tableMeta.schemaPreservesCase + val tableName = relation.tableMeta.identifier.table + if (shouldInfer) { + logInfo(s"Inferring case-sensitive schema for table $tableName (inference mode: " + + s"$inferenceMode)") + val fileIndex = fileIndexOpt.getOrElse { + val rootPath = new Path(new URI(relation.tableMeta.location)) + new InMemoryFileIndex(sparkSession, Seq(rootPath), options, None) + } + + val inferredSchema = { + val schema = fileFormat.inferSchema( --- End diff -- Let me lay out the scenario that I think ```mergeMetastoreParquetSchema``` is meant to guard against: * We have an external table named ```test_table``` containing the following fields: * field_one, a nullable String field * field_two, a nullable String field * partition_col, a String partition column Now lets say we add the following Parquet file partitions to the table: * ```part_col=both_values``` where each row in the file contains: * a value for ```field_one``` * a value for ```field_two``` * ```part_col=one_only``` where each row in the file contains: * a value for ```field_one``` * no ```field_two``` column * ```part_col=two_only``` where each row in the file contains: * no ```field_one``` column * a value for ```field_two``` For now, let's assume that ```inferSchema()``` was invoked after partition pruning and will only consider the Parquet files corresponding to the partitions being queried against. Let's consider what the resulting DataFrame query will be for the following SQL queries: * ```SELECT * FROM test_table WHERE part_col='both_values'``` ```scala StructType( StructField("field_one", StringType, true), StructField("field_two", StringType, true) StructField("partition_col", StringType, true)) ``` * ```SELECT * FROM test_table WHERE part_col='one_only'``` ```scala StructType( StructField("field_one", StringType, true), StructField("partition_col", StringType, true)) ``` * ```SELECT * FROM test_table WHERE part_col='two_only'``` ```scala StructType( StructField("field_two", StringType, true), StructField("partition_col", StringType, true)) ``` The ```mergeMetastoreParquetSchema``` method exists to fill in any nullable fields obtained from the metastore that weren't present in the underlying Parquet files that were inspected. Again, #5214 along with its associated JIRA provide a bit more context here. This particular example is a bit contrived in this context as partition pruning is now a lazy operation that happens in the optimization stage and the inferred schema should scan all of the partitions that back the table (as far as I can tell), but there's still the edge case where all of a table's data files may not contain a field that is listed as nullable in the metastore schema. A user may write an application with the assumption that such a field is present in the DataFrame returned by a query against a table since the column is in the metastore which would fail at runtime if no merging occurs. I can remove this call if you still feel it's not suitable, I just want to make clear the reason why it is there.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org