Re: [PR] HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics [hive]

via GitHub Tue, 03 Dec 2024 00:48:47 -0800


InvisibleProgrammer commented on code in PR #5567:
URL: https://github.com/apache/hive/pull/5567#discussion_r1867281148



##########
standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/ObjectStore.java:
##########
@@ -10296,18 +10297,23 @@ public Map<String, String> 
updateTableColumnStatistics(ColumnStatistics colStats
       }
 
       // TODO: (HIVE-20109) ideally the col stats stats should be in colstats, 
not in the table!
-      // Set the table properties
-      // No need to check again if it exists.
-      String dbname = table.getDbName();
-      String name = table.getTableName();
-      MTable oldt = mTable;
       Map<String, String> newParams = new HashMap<>(table.getParameters());
-      StatsSetupConst.setColumnStatsState(newParams, colNames);
-      boolean isTxn = TxnUtils.isTransactionalTable(oldt.getParameters());
-      if (isTxn) {
-        if (!areTxnStatsSupported) {
-          StatsSetupConst.setBasicStatsState(newParams, StatsSetupConst.FALSE);
-        } else {
+
+      int retries = 3;
+      boolean success = false;
+      while (!success && retries > 0) {

Review Comment:
   No. 
   
   ```Summary: 
   updateTableColumnStatistics can throw 
SQLIntegrityConstraintViolationException during replication if HA is on and two 
different HMS instance gets the same call but with different engine. 
    
   Workaround: 
   Update table column statistics in single threaded. 
    
   Details: 
   updateTableColumnStatistics has a relative long running transaction. In that 
transaction, it validates the actual parameters, queries the metastore db 
against the TABLE_PARAMS that are already stored and makes a decision based on 
that. After this, it uses data nucleus to persist the new statistics. 
   From the two HMS instances, one can save the column statistics. And the 
other cannot as the first instance already saved them. 
   ```
   
   The point is that both process A and process B decides to store the new 
data. On db level, it is an insert. Process A commits the insert. Process B 
fails with constraint violation as it is already exists. If we retry process B, 
it queries the current state of the statistics again so now it will make a 
decision to do update, instead of insert. 
   Unfortunately, DataNucleus doesn't know such a thing like upsert - it would 
be way easier in that way... 
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HIVE-28578: Fix concurrency issue in ObjectStore#updateTableColumnStatistics [hive]

Reply via email to