[ 
https://issues.apache.org/jira/browse/HIVE-27976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Taraka Rama Rao Lethavadla updated HIVE-27976:
----------------------------------------------
    Description: 
removeDuplicateCompletedTxnComponents seems to take more time in busy clusters 
where the number of entries grow at a higher rate in COMPLETED_TXN_COMPONENTS

Copying discussion from 
[HIVE-27022|https://github.com/apache/hive/pull/4970#issuecomment-1875219288]

What about considering an other approach? What if instead of running the clean 
up in one, large transaction, we try to run multiple small ones?

For example, at MSSQL we did measures back then and found that when we want to 
delete large amount of records, it is way faster in batches under 5000 element 
(usually we used 4000).

The reason why it was faster was the locking mechanism of the database: for 
large amount of records, it put exclusive locks and you cannot use the table in 
other processes and it can cause performance issues.

The clean up in that case is a little bit complicated: it can be slow because 
of the time to take to delete the records or it can be the time to collect the 
records that we want to delete.

For those kind of scenarios I would recommend to have two parameters for the 
clean up:
 * Batch size
 * Number of iterations
And I would still keep a 1 minute interval as default.

So that, it can be easily to fine tune the parameters for the customers: if 
there are too many records to delete, just increase the number of iterations. 
If it takes too large of time to collect what to delete, increase the time 
window and/or the batch size.

Another thought, 

How about modifying the query like
{noformat}
DELETE FROM "completed_txn_components" "tc" WHERE  rowid in (SELECT :"SYS_B_0"  
              FROM   "completed_txn_components"                WHERE  
"ctc_database" = "tc"."ctc_database"                       AND "ctc_table" = 
"tc"."ctc_table"                       AND ( "ctc_partition" = 
"tc"."ctc_partition"                              OR ( "ctc_partition" IS NULL  
                                 AND "tc"."ctc_partition" IS NULL ) )           
            AND ( "tc"."ctc_update_delete" = :"SYS_B_1"                         
     OR "tc"."ctc_update_delete" = :"SYS_B_2"                                 
AND "ctc_update_delete" = :"SYS_B_3" )                       AND 
"tc"."ctc_writeid" < "ctc_writeid") {noformat}
Or

How about we do this clean up query to clean entries related to a 
table/partition as part of Cleaner itself? so that the overall load on house 
keeper get's reduced

  was:
removeDuplicateCompletedTxnComponents seems to take more time in busy clusters 
where the number of entries grow at a higher rate in COMPLETED_TXN_COMPONENTS

Copying discussion from 
[HIVE-27022|https://github.com/apache/hive/pull/4970#issuecomment-1875219288]

What about considering an other approach? What if instead of running the clean 
up in one, large transaction, we try to run multiple small ones?

For example, at MSSQL we did measures back then and found that when we want to 
delete large amount of records, it is way faster in batches under 5000 element 
(usually we used 4000).

The reason why it was faster was the locking mechanism of the database: for 
large amount of records, it put exclusive locks and you cannot use the table in 
other processes and it can cause performance issues.

The clean up in that case is a little bit complicated: it can be slow because 
of the time to take to delete the records or it can be the time to collect the 
records that we want to delete.

For those kind of scenarios I would recommend to have two parameters for the 
clean up:
 * Batch size
 * Number of iterations
And I would still keep a 1 minute interval as default.

So that, it can be easily to fine tune the parameters for the customers: if 
there are too many records to delete, just increase the number of iterations. 
If it takes too large of time to collect what to delete, increase the time 
window and/or the batch size.

Another thought, 

How about modifying the query like
{noformat}
DELETE FROM "completed_txn_components" "tc" WHERE  rowid in (SELECT :"SYS_B_0"  
              FROM   "completed_txn_components"                WHERE  
"ctc_database" = "tc"."ctc_database"                       AND "ctc_table" = 
"tc"."ctc_table"                       AND ( "ctc_partition" = 
"tc"."ctc_partition"                              OR ( "ctc_partition" IS NULL  
                                 AND "tc"."ctc_partition" IS NULL ) )           
            AND ( "tc"."ctc_update_delete" = :"SYS_B_1"                         
     OR "tc"."ctc_update_delete" = :"SYS_B_2"                                 
AND "ctc_update_delete" = :"SYS_B_3" )                       AND 
"tc"."ctc_writeid" < "ctc_writeid") {noformat}


> Improve logic/query to clean COMPLETED_TXN_COMPONENTS table
> -----------------------------------------------------------
>
>                 Key: HIVE-27976
>                 URL: https://issues.apache.org/jira/browse/HIVE-27976
>             Project: Hive
>          Issue Type: Improvement
>          Components: Hive
>            Reporter: Taraka Rama Rao Lethavadla
>            Priority: Major
>
> removeDuplicateCompletedTxnComponents seems to take more time in busy 
> clusters where the number of entries grow at a higher rate in 
> COMPLETED_TXN_COMPONENTS
> Copying discussion from 
> [HIVE-27022|https://github.com/apache/hive/pull/4970#issuecomment-1875219288]
> What about considering an other approach? What if instead of running the 
> clean up in one, large transaction, we try to run multiple small ones?
> For example, at MSSQL we did measures back then and found that when we want 
> to delete large amount of records, it is way faster in batches under 5000 
> element (usually we used 4000).
> The reason why it was faster was the locking mechanism of the database: for 
> large amount of records, it put exclusive locks and you cannot use the table 
> in other processes and it can cause performance issues.
> The clean up in that case is a little bit complicated: it can be slow because 
> of the time to take to delete the records or it can be the time to collect 
> the records that we want to delete.
> For those kind of scenarios I would recommend to have two parameters for the 
> clean up:
>  * Batch size
>  * Number of iterations
> And I would still keep a 1 minute interval as default.
> So that, it can be easily to fine tune the parameters for the customers: if 
> there are too many records to delete, just increase the number of iterations. 
> If it takes too large of time to collect what to delete, increase the time 
> window and/or the batch size.
> Another thought, 
> How about modifying the query like
> {noformat}
> DELETE FROM "completed_txn_components" "tc" WHERE  rowid in (SELECT 
> :"SYS_B_0"                FROM   "completed_txn_components"                
> WHERE  "ctc_database" = "tc"."ctc_database"                       AND 
> "ctc_table" = "tc"."ctc_table"                       AND ( "ctc_partition" = 
> "tc"."ctc_partition"                              OR ( "ctc_partition" IS 
> NULL                                   AND "tc"."ctc_partition" IS NULL ) )   
>                     AND ( "tc"."ctc_update_delete" = :"SYS_B_1"               
>                OR "tc"."ctc_update_delete" = :"SYS_B_2"                       
>           AND "ctc_update_delete" = :"SYS_B_3" )                       AND 
> "tc"."ctc_writeid" < "ctc_writeid") {noformat}
> Or
> How about we do this clean up query to clean entries related to a 
> table/partition as part of Cleaner itself? so that the overall load on house 
> keeper get's reduced



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to