empcl commented on PR #11069:
URL: https://github.com/apache/hudi/pull/11069#issuecomment-2071345347
Let me describe the scene. First, `set
hoodie.datasource.write.row.writer.enable = false;`
Then execute the following statement:
```
Merge into source1. hudi_cow-append t0
Using(
Select *, 1 as sex from source1. tmp_cow_table
)S0
On t0. id=s0. id
When matched then update set id=s0. id, name=s0. name, sex=s0. sex
When not matched then insert *;
```
When writing to mergeInto (`HoodieSparkSqlWriterInternal #
writeInternal()`), the mergeInto schema will be added to the
`writeConfig#hoodie.avro.schema` configuration. When executing clustering
(`MultipleSparkJobExecutionStrategy#readRecordsForGroupBaseFiles`), directly
retrieve the configuration `hoodie.avro.schema` from WriterConfig and add it to
the `parquet.avro.projection` configuration. When writing to clustering, the
`parquet.avro.projection` configuration will be obtained and a requestedScheme
will be generated. The schema structure will be obtained from the parquet file
and the fileScheme will be claimed. Comparing requestScheme and fileScheme, it
was found that the sex field repetition is inconsistent.
fileSchema:
```
message hoodie.tmp_cow_table.tmp_cow_table_record {
optional binary _hoodie_commit_time (UTF8);
optional binary _hoodie_commit_seqno (UTF8);
optional binary _hoodie_record_key (UTF8);
optional binary _hoodie_partition_path (UTF8);
optional binary _hoodie_file_name (UTF8);
optional int32 id;
optional binary name (UTF8);
optional binary age (UTF8);
}
```
request Schema:
```
message hoodie.tmp_cow_table.tmp_cow_table_record {
optional binary _hoodie_commit_time (UTF8);
optional binary _hoodie_commit_seqno (UTF8);
optional binary _hoodie_record_key (UTF8);
optional binary _hoodie_partition_path (UTF8);
optional binary _hoodie_file_name (UTF8);
optional int32 id;
optional binary name (UTF8);
required binary age (UTF8);
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]