[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

GitBox Mon, 06 Apr 2020 08:02:28 -0700

vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop 
incremental reader on RO table before a pending compaction
URL: https://github.com/apache/incubator-hudi/pull/1396#discussion_r404162013


 ##########
 File path: 
hudi-hadoop-mr/src/main/java/org/apache/hudi/hadoop/HoodieParquetInputFormat.java
 ##########
 @@ -118,6 +119,34 @@
     return returns.toArray(new FileStatus[returns.size()]);
   }
 
+  /**
+   * Filter any specific instants that we do not want to process.
+   * example timeline:
+   *
+   * t0 -> create bucket1.parquet
+   * t1 -> create and append updates bucket1.log
+   * t2 -> request compaction
+   * t3 -> create bucket2.parquet
+   *
+   * if compaction at t2 takes a long time, incremental readers on RO tables 
can move to t3 and would skip updates in t1
+   *
+   * To workaround this problem, we want to stop returning data belonging to 
commits > t2.
+   * After compaction is complete, incremental reader would see updates in t2, 
t3, so on.
+   */
+  protected HoodieDefaultTimeline filterInstantsTimeline(HoodieDefaultTimeline 
timeline) {
+    Option<HoodieInstant> pendingCompactionInstant = 
timeline.filterPendingCompactionTimeline().firstInstant();
+    if (pendingCompactionInstant.isPresent()) {
 
 Review comment:
   Once again this is not a bug :) .. Its not supposed to be used this way.. I 
warned against this in fact, in the past as well .. anyways, water under the 
bridge..
   
   I can see how this approach specifically helps the way your datasets... I am 
fine landing this change per se.. Let's introduce a jobConf variable to control 
this behavior? we can turn this off by default and you can ask the derived ETL 
to turn this on? (I am find making it default to on also, your call)
   
   To be on the same page, thinking out loud.. This change is orthogonal to the 
compaction strategy right.. for e.g , the compaction may not compact all the 
data in log (lets say it only compacts the last N partitions), then the records 
from older partitions won't show in the change stream right until later... But 
what this fixes is avoiding incremental reader to advance ahead when it sees a 
compaction... (The data skipping happens because compaction happens at a much 
later time and incremental reader moves onto other delta commits for e.g).. 
   
   P.S: Another things to  double check is that the records written into base 
file during compaction, have the `_hoodie_commit_time` of the compaction and 
not the original write. 
   
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [incubator-hudi] vinothchandar commented on a change in pull request #1396: [HUDI-687] Stop incremental reader on RO table before a pending compaction

Reply via email to