jonvex commented on code in PR #10858:
URL: https://github.com/apache/hudi/pull/10858#discussion_r1524990878
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java:
##########
@@ -574,12 +574,12 @@ private InputBatch
fetchNextBatchFromSource(Option<String> resumeCheckpointStr,
checkpointStr = dataAndCheckpoint.getCheckpointForNextBatch();
if (this.userProvidedSchemaProvider != null &&
this.userProvidedSchemaProvider.getTargetSchema() != null
&& this.userProvidedSchemaProvider.getTargetSchema() !=
InputBatch.NULL_SCHEMA) {
+ // Let's deduce the schema provider for writer side first!
+ schemaProvider =
getDeducedSchemaProvider(this.userProvidedSchemaProvider.getTargetSchema(),
this.userProvidedSchemaProvider, metaClient);
Review Comment:
If you need more convincing, you can see at line 616 that we were already
using deduced schema for row writer in that case. So we were already being
inconsistent with when we use deduce. Additionally, in spark sql writer, all
schemas go through deduce, even when bulk insert row writer is used there. So
not promising that bulk insert row writer will have schema evolution working
perfectly, but at least we will be consistent now.
Deduce also fixes null schemas so they don't cause NPE. That's why my change
at 610-613 is needed. The changes at 578-580 I will fight less hard for, but I
think this will come back to haunt us later if we don't make that change now.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]