hudi-agent commented on code in PR #18896:
URL: https://github.com/apache/hudi/pull/18896#discussion_r3343975743


##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/HiveIncrPullSource.java:
##########
@@ -133,7 +133,7 @@ protected InputBatch<JavaRDD<GenericRecord>> 
readFromCheckpoint(Option<Checkpoin
           AvroKey.class, NullWritable.class, 
sparkContext.hadoopConfiguration());
       sparkContext.setJobGroup(this.getClass().getSimpleName(), "Fetch new 
data");
       return new InputBatch<>(Option.of(avroRDD.keys().map(r -> 
((GenericRecord) r.datum()))),
-          String.valueOf(commitToPull.get()));
+          createCheckpoint(writeTableVersion, 
String.valueOf(commitToPull.get())));

Review Comment:
   🤖 I think `String.valueOf(commitToPull.get())` here ends up calling 
`Checkpoint.toString()`, which returns something like 
`"StreamerCheckpointV2{checkpointKey='20200101000000'}"` (not the bare instant 
time). That string then becomes the new checkpoint key, so subsequent runs 
would compare instant times against 
`"StreamerCheckpointV{1,2}{checkpointKey='…'}"` and the loop in 
`findCommitToPull` would never advance. This is a pre-existing bug, but since 
you're touching this exact line — would just `commitToPull.get()` (or 
`createCheckpoint(writeTableVersion, commitToPull.get().getCheckpointKey())`) 
work, given `findCommitToPull` already returns a correctly-versioned 
`Checkpoint`? The new `testHiveIncrPullSource` only exercises 
`findCommitToPull` directly, so it doesn't catch this.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



##########
hudi-common/src/main/java/org/apache/hudi/common/table/checkpoint/CheckpointUtils.java:
##########
@@ -71,6 +71,11 @@ public static Checkpoint getCheckpoint(HoodieCommitMetadata 
commitMetadata) {
     throw new HoodieException("Checkpoint is not found in the commit metadata: 
" + commitMetadata.getExtraMetadata());
   }
 
+  public static Checkpoint createCheckpoint(int writeTableVersion, String 
checkpointToResume) {

Review Comment:
   🤖 nit: could you rename this to `buildCheckpoint` (or rename 
`buildCheckpointFromGeneralSource` to `createCheckpointFromGeneralSource`) so 
the two factory methods in the same class use a consistent verb? As it stands, 
a caller trying to pick between them has to read both signatures to understand 
the difference.
   
   <sub><i>- AI-generated; verify before applying. React 👍/👎 to flag 
quality.</i></sub>



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to