[
https://issues.apache.org/jira/browse/HBASE-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prathyusha updated HBASE-29109:
-------------------------------
Description:
{{SnapshotManager does *prepare* and *process* as below}}
{quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();
{color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
(process called here){color}
{quote}
For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout
and right when it starts processing it releases the current EXCLUSIVE lock on
the table and takes a SHARED one instead
{{{color:#000000}
{color}}}
{quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000} {{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
(downgradeToSharedTableLock()) {{color}
{color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold
the shared lock instead{color}
{color:#000000} {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}
=
{color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
{color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000},
{color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() +
{color}{color:#2a00ff}": take snapshot "{color}{color:#000000} +
{color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}
{color:#000000}
{color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}
{color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000}
{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} =
{color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
(!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}
{color:#000000}
{color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
not acquire shared lock on table {} in {} ms"{color}{color:#000000},
{color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
{color:#000000}
{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
{color:#000000} {color}{color:#7f0055}throw{color}{color:#000000}
{color}{color:#7f0055}new{color}{color:#000000}
IOException({color}{color:#2a00ff}"Could not acquire shared lock on table
"{color}{color:#000000} +
{color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}
{color:#000000} }{color}
{color:#000000} }{color}
{quote}
Since the process state runs in ExecutorService sometimes there is a delay
and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child procedures
of SCP
Below are the logs we saw for the same in our production
Process stage of Snapshot started
{quote}2025-01-15 02:15:09,608 INFO
[ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360]
snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot
HGRATE_FEEDS.FEED_COMPOSITION_1736902854385_1736906318965_0 C_M_SNAPSHOT_TABLE
on table <TABLE_NAME>
{quote}
then the EXCLUSIVE lock is released
{quote}2025-01-15 02:15:09,615 INFO [PEWorker-20] procedure2.ProcedureExecutor
- Finished pid=26291867, state=SUCCESS;
org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>,
type=EXCLUSIVE in 9 mins, 55.422 sec
{quote}
ASSIGN of SCP was able to then get the lock
{quote}2025-01-15 02:15:09,615 INFO [PEWorker-20]
procedure.MasterProcedureScheduler - Took xlock for pid=26292238,
ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
TransitRegionStateProcedure table=<TABLE_NAME>,
region=e9d104da353c65725b33f902c8a1449d, ASSIGN
{quote}
Though once lock is acquired ASSIGN was pretty quick, total time of that region
availability shoot up to 9 mins
{quote}2025-01-15 02:15:10,384 INFO [PEWorker-24] procedure2.ProcedureExecutor
- Finished pid=26292238, ppid=26291869, state=SUCCESS;
TransitRegionStateProcedure table=<TABLE_NAME>,
region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
{quote}
> Process stage of TakeSnapshotHandler runs Async which is holding the
> EXCLUSIVE Lock of Table for around 9 mins blocking ASSIGN procedures of SCP
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29109
> URL: https://issues.apache.org/jira/browse/HBASE-29109
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.5.8
> Reporter: Prathyusha
> Assignee: Prathyusha
> Priority: Major
>
> {{SnapshotManager does *prepare* and *process* as below}}
> {quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();
> {color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
> (process called here){color}
> {quote}
> For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout
> and right when it starts processing it releases the current EXCLUSIVE lock on
> the table and takes a SHARED one instead
> {{{color:#000000}
> {color}}}
> {quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000}
> {{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> (downgradeToSharedTableLock()) {{color}
> {color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold
> the shared lock instead{color}
> {color:#000000}
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000} =
> {color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000},
> {color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() +
> {color}{color:#2a00ff}": take snapshot "{color}{color:#000000} +
> {color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}
> {color:#000000}
> {color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}
> {color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000}
> {color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} =
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> (!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}
> {color:#000000}
> {color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
> not acquire shared lock on table {} in {} ms"{color}{color:#000000},
> {color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000}
> {color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}throw{color}{color:#000000}
> {color}{color:#7f0055}new{color}{color:#000000}
> IOException({color}{color:#2a00ff}"Could not acquire shared lock on table
> "{color}{color:#000000} +
> {color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}
> {color:#000000} }{color}
> {color:#000000} }{color}
> {quote}
> Since the process state runs in ExecutorService sometimes there is a delay
> and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child
> procedures of SCP
> Below are the logs we saw for the same in our production
> Process stage of Snapshot started
> {quote}2025-01-15 02:15:09,608 INFO
> [ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360]
> snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot
> HGRATE_FEEDS.FEED_COMPOSITION_1736902854385_1736906318965_0
> C_M_SNAPSHOT_TABLE on table <TABLE_NAME>
> {quote}
> then the EXCLUSIVE lock is released
> {quote}2025-01-15 02:15:09,615 INFO [PEWorker-20]
> procedure2.ProcedureExecutor - Finished pid=26291867, state=SUCCESS;
> org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>,
> type=EXCLUSIVE in 9 mins, 55.422 sec
> {quote}
> ASSIGN of SCP was able to then get the lock
> {quote}2025-01-15 02:15:09,615 INFO [PEWorker-20]
> procedure.MasterProcedureScheduler - Took xlock for pid=26292238,
> ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>,
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN
> {quote}
> Though once lock is acquired ASSIGN was pretty quick, total time of that
> region availability shoot up to 9 mins
> {quote}2025-01-15 02:15:10,384 INFO [PEWorker-24]
> procedure2.ProcedureExecutor - Finished pid=26292238, ppid=26291869,
> state=SUCCESS; TransitRegionStateProcedure table=<TABLE_NAME>,
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)