satishkotha commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r403836768
 
 

 ##########
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##########
 @@ -118,6 +119,34 @@
     return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+    Option<HoodieInstant> pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+    if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   Yes, this is the crux of the change. My understanding is this bug caused 
data loss on derived ETL tables multiple times. These ETL tables are generated 
using incremental reads on "RO"views. As you suggested, that is core issue and 
switching to RT views is likely going to get rid of the problem. 
   
   Also given "getting started" and other demo examples  include incremental 
reads on RO views,  I think this new safeguard is useful to have, especially 
given that finding root cause  for this took a while.  I am fine with 
abandoning this change if we can remove incremental read examples on RO views 
in documentation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to