Re: [PR] [HUDI-7500] fix gaps with deduce schema and null schema [hudi]

via GitHub Thu, 14 Mar 2024 07:36:09 -0700


jonvex commented on code in PR #10858:
URL: https://github.com/apache/hudi/pull/10858#discussion_r1524990878



##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -574,12 +574,12 @@ private InputBatch 
fetchNextBatchFromSource(Option<String> resumeCheckpointStr,
       checkpointStr = dataAndCheckpoint.getCheckpointForNextBatch();
       if (this.userProvidedSchemaProvider != null && 
this.userProvidedSchemaProvider.getTargetSchema() != null
           && this.userProvidedSchemaProvider.getTargetSchema() != 
InputBatch.NULL_SCHEMA) {
+        // Let's deduce the schema provider for writer side first!
+        schemaProvider = 
getDeducedSchemaProvider(this.userProvidedSchemaProvider.getTargetSchema(), 
this.userProvidedSchemaProvider, metaClient);

Review Comment:
   If you need more convincing, you can see at line 616 that we were already 
using deduced schema for row writer in that case. So we were already being 
inconsistent with when we use deduce. Additionally, in spark sql writer, all 
schemas go through deduce, even when bulk insert row writer is used there. So 
not promising that bulk insert row writer will have schema evolution working 
perfectly, but at least we will be consistent now. 
   
   Deduce also fixes null schemas so they don't cause NPE. That's why my change 
at 610-613 is needed. The changes at 578-580 I will fight less hard for, but I 
think this will come back to haunt us later if we don't make that change now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7500] fix gaps with deduce schema and null schema [hudi]

Reply via email to