Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

via GitHub Tue, 10 Jun 2025 07:30:28 -0700


zhangyue19921010 commented on code in PR #13365:
URL: https://github.com/apache/hudi/pull/13365#discussion_r2138025106



##########
hudi-client/hudi-spark-client/src/main/scala/org/apache/hudi/HoodieDatasetBulkInsertHelper.scala:
##########
@@ -150,14 +149,8 @@ object HoodieDatasetBulkInsertHelper
                  table: HoodieTable[_, _, _, _],
                  writeConfig: HoodieWriteConfig,
                  arePartitionRecordsSorted: Boolean,
-                 shouldPreserveHoodieMetadata: Boolean,
-                 operation: WriteOperationType): HoodieData[WriteStatus] = {
-    val schema = operation match {
-      case WriteOperationType.CLUSTER =>
-        alignNotNullFields(dataset.schema, new 
Schema.Parser().parse(writeConfig.getSchema))
-      case _ =>
-        dataset.schema
-    }
+                 shouldPreserveHoodieMetadata: Boolean): 
HoodieData[WriteStatus] = {
+    val schema = AvroConversionUtils.alignFields(dataset.schema, new 
Schema.Parser().parse(writeConfig.getSchema))

Review Comment:
   This data class correction is necessary. The existing "bulk insert as row" 
operation fetches the DataFrame's schema(may inconsistencies with table 
schema). This inconsistencies in the nullable properties of identical columns 
lead to file corruption/unreadability after binary copying. Therefore, we've 
implemented a nullable type correction for the bulk insert as row operation to 
resolve this issue.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-9468] Parquet Binary Copy at Rowgroup Level [hudi]

Reply via email to