[GitHub] [spark] budde commented on a change in pull request #16944: [SPARK-19611][SQL] Introduce configurable table schema inference

GitBox Mon, 22 Jun 2020 16:38:03 -0700


budde commented on a change in pull request #16944:
URL: https://github.com/apache/spark/pull/16944#discussion_r443880996




##########
File path: 
sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
##########
@@ -218,6 +228,54 @@ private[hive] class HiveMetastoreCatalog(sparkSession: 
SparkSession) extends Log
     result.copy(expectedOutputAttributes = Some(relation.output))
   }
 
+  private def inferIfNeeded(
+      relation: CatalogRelation,
+      options: Map[String, String],
+      fileFormat: FileFormat,
+      fileIndexOpt: Option[FileIndex] = None): (StructType, CatalogTable) = {
+    val inferenceMode = 
sparkSession.sessionState.conf.caseSensitiveInferenceMode
+    val shouldInfer = (inferenceMode != NEVER_INFER) && 
!relation.tableMeta.schemaPreservesCase
+    val tableName = relation.tableMeta.identifier.unquotedString
+    if (shouldInfer) {
+      logInfo(s"Inferring case-sensitive schema for table $tableName 
(inference mode: " +
+        s"$inferenceMode)")
+      val fileIndex = fileIndexOpt.getOrElse {
+        val rootPath = new Path(relation.tableMeta.location)
+        new InMemoryFileIndex(sparkSession, Seq(rootPath), options, None)
+      }
+
+      val inferredSchema = fileFormat
+        .inferSchema(
+          sparkSession,
+          options,
+          fileIndex.listFiles(Nil).flatMap(_.files))

Review comment:
       To expand on this, let's say you are dealing with a dataset that has an 
optional field in its schema. If none of the data files for the partition(s) 
you filter by contain this field then it will never be returned when using 
schema inference. This may not matter in your particular usecase but IMO 
inspecting all of the files when inferring the schema is the safer approach in 
the general case. I believe you can configure a threshold for the number of 
data files at which the schema inference step will be distributed across the 
cluster and performed as a map reduce job vs. running on the driver node.
   
   If the full schema is already known beforehand and you're not using any sort 
of metastore to track table/schema state, using `NEVER_INFER` in conjunction 
with manually providing the schema as a read option might be a good approach.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] budde commented on a change in pull request #16944: [SPARK-19611][SQL] Introduce configurable table schema inference

Reply via email to