zhangyue19921010 commented on code in PR #13365:
URL: https://github.com/apache/hudi/pull/13365#discussion_r2138025106
##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,14 +149,8 @@ object HoodieDatasetBulkInsertHelper
table: HoodieTable[_, _, _, _],
writeConfig: HoodieWriteConfig,
arePartitionRecordsSorted: Boolean,
- shouldPreserveHoodieMetadata: Boolean,
- operation: WriteOperationType): HoodieData[WriteStatus] = {
- val schema = operation match {
- case WriteOperationType.CLUSTER =>
- alignNotNullFields(dataset.schema, new
Schema.Parser().parse(writeConfig.getSchema))
- case _ =>
- dataset.schema
- }
+ shouldPreserveHoodieMetadata: Boolean):
HoodieData[WriteStatus] = {
+ val schema = AvroConversionUtils.alignFields(dataset.schema, new
Schema.Parser().parse(writeConfig.getSchema))
Review Comment:
This data class correction is necessary. The existing "bulk insert as row"
operation fetches the DataFrame's schema(may inconsistencies with table
schema). This inconsistencies in the nullable properties of identical columns
lead to file corruption/unreadability after binary copying. Therefore, we've
implemented a nullable type correction for the bulk insert as row operation to
resolve this issue.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]