InvisibleProgrammer commented on PR #4970:
URL: https://github.com/apache/hive/pull/4970#issuecomment-1875219288

   Hi, please let me putting some thoughts about the change here: 
   
   Firstly, I like the idea of decoupling the AcidHouseKeeperService. As I see, 
it already does five different things so that change is one small steps in a 
direction of one object doing one thing:
   ```java
       performTask(txnHandler::performTimeOuts, "Cleaning timed out txns and 
locks");
       performTask(txnHandler::performWriteSetGC, "Cleaning obsolete write set 
entries");
       performTask(txnHandler::cleanTxnToWriteIdTable, "Cleaning obsolete 
TXN_TO_WRITE_ID entries");
       if (isCompactorEnabled) {
         performTask(txnHandler::removeDuplicateCompletedTxnComponents, 
"Cleaning duplicate COMPLETED_TXN_COMPONENTS entries");
         performTask(txnHandler::purgeCompactionHistory, "Cleaning obsolete 
compaction history entries");
       }
   ```
   
   However, I have some second thoughts about how the performance issue is 
solved: the issue is that there is a clean up in a database that takes too much 
time to run. The solution: increase the time window. The thing that concerns me 
about that is it more putting the problem away stuff. Currently, we have run 
the clean up in every minute. After that change, the duplicated txn components 
clean up will run in every 5 minutes. And it will be the default. 
   
   That can lead to larger intervals of table locks on the table that we use. 
What if it lead to deadlocks? At relational databases, the best practise is to 
keep our transactions short. 
   
   What about considering an other approach? What if instead of running the 
clean up in one, large transaction, we try to run multiple small ones? 
   I don't know what kind of database the customer has but for example, at 
MSSQL we did measures back then and found that when we want to delete large 
amount of records, it is way faster in batches under 5000 element (usually we 
used 4000). The reason why it was faster was the locking mechanism of the 
database: for large amount of records, it put exclusive locks and you cannot 
use the table in other processes and it can cause performance issues. 
   
   The clean up in that case is a little bit complicated: it can be slow 
because of the time to take to delete the records or it can be the time to 
collect the records that we want to delete. 
   
   For those kind of scenarios I would recommend to have two parameters for the 
clean up: 
   - Batch size
   - Number of iterations
   And I would still keep a 1 minute interval as default. 
   
   So that, it can be easily to fine tune the parameters for the customers: if 
there are too many records to delete, just increase the number of iterations. 
If it takes too large of time to collect what to delete, increase the time 
window and/or the batch size. 
   
   What do you think about that?  


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: gitbox-unsubscr...@hive.apache.org
For additional commands, e-mail: gitbox-h...@hive.apache.org

Reply via email to