umehrot2 commented on a change in pull request #1702:
URL: https://github.com/apache/hudi/pull/1702#discussion_r447560016
##########
File path: hudi-spark/src/main/scala/org/apache/hudi/IncrementalRelation.scala
##########
@@ -71,13 +78,16 @@ class IncrementalRelation(val sqlContext: SQLContext,
optParams.getOrElse(DataSourceReadOptions.END_INSTANTTIME_OPT_KEY,
lastInstant.getTimestamp))
.getInstants.iterator().toList
- // use schema from latest metadata, if not present, read schema from the
data file
- private val latestSchema = {
- val schemaUtil = new TableSchemaResolver(metaClient)
- val tableSchema =
HoodieAvroUtils.createHoodieWriteSchema(schemaUtil.getTableAvroSchemaWithoutMetadataFields);
- AvroConversionUtils.convertAvroSchemaToStructType(tableSchema)
+ // use schema from a file produced in the latest instant
+ val latestSchema: StructType = {
+ log.info("Inferring schema..")
+ val schemaResolver = new TableSchemaResolver(metaClient)
+ val tableSchema = schemaResolver.getTableAvroSchemaWithoutMetadataFields
+ val dataSchema =
AvroConversionUtils.convertAvroSchemaToStructType(tableSchema)
+ StructType(skeletonSchema.fields ++ dataSchema.fields)
Review comment:
Not really. The reason for doing it this way is to intentionally only
read the **user data schema** and then append the **metadata/skeleton schema**
to it. This avoids us to have unnecessary checks here, because if we read the
whole schema then there will be differences. For regular hudi files, the schema
would have both **skeleton + user data schema** whereas for **bootstrapped
files** the schema would only have **user data schema** read from the source
file. So to keep things simple I did it this way.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]