[GitHub] [hudi] nsivabalan commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

GitBox Sat, 04 Jun 2022 05:12:34 -0700


nsivabalan commented on code in PR #5737:
URL: https://github.com/apache/hudi/pull/5737#discussion_r889523138



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieBaseRelation.scala:
##########
@@ -122,29 +122,39 @@ abstract class HoodieBaseRelation(val sqlContext: 
SQLContext,
     optParams.get(DataSourceReadOptions.TIME_TRAVEL_AS_OF_INSTANT.key)
       .map(HoodieSqlCommonUtils.formatQueryInstant)
 
+  /**
+   * NOTE: Initialization of teh following members is coupled on purpose to 
minimize amount of I/O
+   *       required to fetch table's Avro and Internal schemas
+   */
   protected lazy val (tableAvroSchema: Schema, internalSchema: InternalSchema) 
= {
-    val schemaUtil = new TableSchemaResolver(metaClient)
-    val avroSchema = Try(schemaUtil.getTableAvroSchema) match {
-      case Success(schema) => schema
-      case Failure(e) =>
-        logWarning("Failed to fetch schema from the table", e)
-        // If there is no commit in the table, we can't get the schema
-        // t/h [[TableSchemaResolver]], fallback to the provided 
[[userSchema]] instead.
-        userSchema match {
-          case Some(s) => convertToAvroSchema(s)
-          case _ => throw new IllegalArgumentException("User-provided schema 
is required in case the table is empty")
-        }
+    val schemaResolver = new TableSchemaResolver(metaClient)
+    val avroSchema: Schema = schemaSpec.map(convertToAvroSchema).getOrElse {
+      Try(schemaResolver.getTableAvroSchema) match {
+        case Success(schema) => schema
+        case Failure(e) =>
+          logError("Failed to fetch schema from the table", e)
+          throw new HoodieSchemaException("Failed to fetch schema from the 
table")
+      }
     }
-    // try to find internalSchema
-    val internalSchemaFromMeta = try {
-      
schemaUtil.getTableInternalSchemaFromCommitMetadata.orElse(InternalSchema.getEmptyInternalSchema)
-    } catch {
-      case _: Exception => InternalSchema.getEmptyInternalSchema
+
+    val schemaEvolutionEnabled: Boolean = 
optParams.getOrElse(DataSourceReadOptions.SCHEMA_EVOLUTION_ENABLED.key,

Review Comment:
   @xiarixiaoyao : A tangential question trying to understand something. can 
you help me understand the interplay here or how the config is used. for the 
first time on read path, if someone enables this config, internal schema will 
be generated. 
   OptionA:
   After that, all read and write paths will rely on presence of InternalSchema 
is it? Even on the read path, we just check for presence of internal schema.
   
   OptionB:
    or is it that write path will always rely on the schema evolution enabled 
config and read path will rely on the presence of internal schema? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] nsivabalan commented on a diff in pull request #5737: [HUDI-4178][Stacked on 5733] Addressing performance regressions in Spark DataSourceV2 Integration

Reply via email to