[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6213: [HUDI-4081] Addressing Spark SQL vs Spark DS performance gap

GitBox Tue, 26 Jul 2022 13:12:30 -0700


alexeykudinkin commented on code in PR #6213:
URL: https://github.com/apache/hudi/pull/6213#discussion_r930366177



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala:
##########
@@ -241,32 +240,41 @@ object HoodieSparkSqlWriter {
             sparkContext.getConf.registerKryoClasses(
               Array(classOf[org.apache.avro.generic.GenericData],
                 classOf[org.apache.avro.Schema]))
+
+            // TODO(HUDI-4472) revisit and simplify schema handling
             var schema = 
AvroConversionUtils.convertStructTypeToAvroSchema(df.schema, structName, 
nameSpace)
-            val lastestSchema = getLatestTableSchema(fs, basePath, 
sparkContext, schema)
+            val latestSchema = getLatestTableSchema(fs, basePath, 
sparkContext, schema)
+
+            val enabledSchemaEvolution = 
parameters.getOrDefault(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key(), 
"false").toBoolean
             var internalSchemaOpt = getLatestTableInternalSchema(fs, basePath, 
sparkContext)
-            if (reconcileSchema && 
parameters.getOrDefault(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key(), 
"false").toBoolean
-              && internalSchemaOpt.isEmpty) {
-              // force apply full schema evolution.
-              internalSchemaOpt = 
Some(AvroInternalSchemaConverter.convert(schema))
-            }
+
             if (reconcileSchema) {
-              schema = lastestSchema
+              // In case we need to reconcile the schema and schema evolution 
is enabled,
+              // we will force-apply schema evolution to the writer's schema.
+              // Otherwise we simply fallback to the latest schema committed
+              if (enabledSchemaEvolution && internalSchemaOpt.isEmpty) {
+                internalSchemaOpt = 
Some(AvroInternalSchemaConverter.convert(schema))
+              } else {
+                schema = latestSchema
+              }
+            } else {
+              // In case reconciliation is disabled, we still have to do 
nullability attributes
+              // (minor) reconciliation, making sure schema of the incoming 
batch is in-line with
+              // the data already committed in the table
+              schema = 
AvroSchemaEvolutionUtils.canonicalizeColumnNullability(schema, latestSchema)

Review Comment:
   All of the surrounding cleanup was necessary to make this change -- it got 
too messy w/ 3 sequential conditionals creating too many possible permutations



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] alexeykudinkin commented on a diff in pull request #6213: [HUDI-4081] Addressing Spark SQL vs Spark DS performance gap

Reply via email to