lokesh-lingarajan-0310 commented on code in PR #9473:
URL: https://github.com/apache/hudi/pull/9473#discussion_r1302262807
##########
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/IncrSourceHelper.java:
##########
@@ -130,7 +130,7 @@ public static QueryInfo generateQueryInfo(JavaSparkContext
jssc, String srcBaseP
}
});
- String previousInstantTime = beginInstantTime;
+ String previousInstantTime = DEFAULT_BEGIN_TIMESTAMP;
if (!beginInstantTime.equals(DEFAULT_BEGIN_TIMESTAMP)) {
Review Comment:
The issue here is partial read of initial commit of new flows. Lets say if
the initial commit is C1 containing k1, k2, k3 objects and if after the first
commit if the sourcelimit allows it to read only k1, we will have the first
checkpoint as C1#k1, following this on the second round of sync, when we come
this api to calculate previous, begin and end instances, if we initialize
previous to begin, we will have previous = C1, begin = C1, end = C5 (lets say
we have had 4 commits after the first one). The code following -
https://github.com/apache/hudi/blob/21e462cca3551eaf84e10442cf7abd25003b40a8/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/QueryRunner.java#L81
Will end up skipping C1 altogether, hence reading C1 partially. The fix is
to default previous to start that way this case is handled and full read of C1
will happen.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]