satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r405144082
 
 

 ##########
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##########
 @@ -118,6 +119,34 @@
     return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+    Option<HoodieInstant> pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+    if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   I can introduce jobConf variable. I'm also agreeing with you that RT is the 
right approach. I'm just suggesting that we remove incremental read examples in 
different documents. For example, docker demo shows[ incremental reads on RO 
views](https://hudi.apache.org/docs/docker_demo.html#step-9-run-hive-queries-including-incremental-queries).
 So, people are likely to use this and end up with this difficult to debug 
problem.
   
   Also, you are right about last statement. I already have tests to show that 
compaction timestamp is used for all updated records and not the update 
timestamp. Please see line 256 (last line in 
TestMergeOnReadTable#testIncrementalReadsWithCompaction)  that does not include 
updateTime. We validate that records include compactionCommitTime instead

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

Reply via email to