n3nash commented on a change in pull request #2188:
URL: https://github.com/apache/hudi/pull/2188#discussion_r514276728



##########
File path: 
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/index/hbase/SparkHoodieHBaseIndex.java
##########
@@ -480,6 +486,61 @@ private Integer getNumRegionServersAliveForTable() {
   @Override
   public boolean rollbackCommit(String instantTime) {
     // Rollback in HbaseIndex is managed via method {@link 
#checkIfValidCommit()}
+    synchronized (SparkHoodieHBaseIndex.class) {

Review comment:
       @hj2016 The problem with firing deletes is that you end up putting extra 
load on the Hbase cluster for every rollback which is undesirable. The index 
and data should eventually be in sync always, let me explain how : 
   
   1) Say the batch of records with 4 records (uuid1, uuid2, uuid3, uuid4) was 
inserted into the index but the batch failed.
   2) Rollback will delete the data but let the index be
   3) The next batch of records with 6 records (uuid1, uuid2, uuid3, uuid4, 
uuid5, uuid6) will now be retried. The HbaseIndex for the first 4 records will 
be overwritten and there won't be any dangling or remnant indexes. 
   The one way this can happen is if you tried a batch of records, they failed 
and then you skipped those batch of records, which isn't a very common scenario.
   If there is another use-case for which you want to delete the records from 
Hbase, we can consider having a config, let me know.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to