[ 
https://issues.apache.org/jira/browse/HBASE-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Stack updated HBASE-24558:
----------------------------------
    Description: 
Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A 
Replication Peer was in place) such that many RSs had hundreds of WALs. Next, a 
power outage-like event where we have to cold-start. Master comes up and files 
SCPs for old server instances; hundreds. SCP finds there are hundreds of WALs 
and dutifully asks the wal-splitter to do its job. Meantime it goes to sleep 
until completion (See HBASE-24545); no status in log, just loads of complaint 
in log about STUCK procedure worker. Operator sees a stalled master dumping 
into its logs the occasional split completion and STUCK procedures but thats 
it. Operator tries poking up in zk under '/hbase/splitWAL' to see whats up but 
gets a spew back of one-line listing thousands of znodes. Operator gets no 
indication of how we are progressing or of when all is done.

In my particular case, there were > 100k WALs backed-up. I could dig around 
some and could see that 40k WALs were in-flight at any one time (necessitating 
upping jute.maxbuffer for zk).  I could do this:

{code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000 
org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code}

but it didn't help much as new WALs got added to the backlog. I could do the 
below but it didn't help when WAL count > 100k:

{code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}

Is the count here all the WALs that need splitting (yes, I think). Its less 
than zk count at any one time. How are they tied? Dunno.

This issue is about improving visibility in our distributed wal-splitter. In 
particular the zk-based splitter visibility is poor (HBASE-21588 is a new 
splitter built on procedures that probably has similar issue). We need to have 
a count of overall outstanding tasks, of how long tasks are taking, and of how 
well distributed the job is (see below).

An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER = 
"hbase.regionserver.wal.max.splitters"{code} which is {code}
public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so in 
extremis, I upped the count to 10 but even then looking at random RS's they 
were only getting work on occasion sometimes with hours passing between split 
tasks as though the Master was bad at passing out the jobs: hard to tell when 
no view on what is going on.

Below is from log of random RS... notice how sporadically it gets split work 
though thousands of WALs to work on:

{code}
  89 2020-06-11 04:38:40,113 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
 in       12309ms. Status = DONE
  90 2020-06-11 04:38:51,445 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
 in        13230ms. Status = DONE
  91 2020-06-11 04:38:57,060 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
 in       16759ms. Status = DONE
  92 2020-06-11 09:17:12,762 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
 in       1181ms. Status = DONE
  93 2020-06-11 09:17:14,512 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
 in       2224ms. Status = DONE
  94 2020-06-11 09:17:14,750 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
 in       1345ms. Status = DONE
  95 2020-06-11 09:17:19,898 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
 in       5319ms. Status = DONE
  96 2020-06-11 09:17:21,558 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
 in       890ms. Status = DONE
  97 2020-06-11 16:12:31,676 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
 in       1193ms. Status = DONE
  98 2020-06-11 16:12:34,140 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
 in       2383ms. Status = DONE
  99 2020-06-11 16:12:35,943 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
 in       2278ms. Status = DONE
 100 2020-06-11 16:12:36,437 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
 in         1497ms. Status = DONE
 101 2020-06-11 16:12:39,953 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
 in       914ms. Status = DONE
{code}

  was:
Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A 
Replication Peer was in place) such that many RSs had hundreds of WALs. Next, a 
power outage-like event where we have to cold-start. Master comes up and files 
SCPs for old server instances; hundreds. SCP finds there are hundreds of WALs 
and dutifully asks the wal-splitter to do its job. Meantime it goes to sleep 
until completion (See HBASE-24545); no status in log, just loads of complaint 
in log about STUCK procedure worker. Operator sees a stalled master dumping 
into its logs the occasional split completion and STUCK procedures but thats 
it. Operator tries poking up in zk under '/hbase/splitWAL' to see whats up but 
gets a spew back of one-line listing thousands of znodes. Operator gets no 
indication of how we are progressing or of when all is done.

In my particular case, there were > 100k WALs backed-up. I could dig around 
some and could see that 40k WALs were in-flight at any one time (necessitating 
upping jute.maxbuffer for zk).  I could do this:

{code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000 
org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code}

but it didn't help much as new WALs got added to the backlog. I could do the 
below but it didn't help when WAL count > 100k:

{code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}

Is the count here all the WALs that need splitting (yes, I think). Its less 
than zk count at any one time. How are they tied? Dunno.

This issue is about improving visibility in our zk-based distributed 
wal-splitter. Needs to be count of overall outstanding tasks, of how long tasks 
are taking, and of how well distributed the job is (see below).

An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER = 
"hbase.regionserver.wal.max.splitters"{code} which is {code}
public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so in 
extremis, I upped the count to 10 but even then looking at random RS's they 
were only getting work on occasion sometimes with hours passing between split 
tasks as though the Master was bad at passing out the jobs: hard to tell when 
no view on what is going on.

Below is from log of random RS... notice how sporadically it gets split work 
though thousands of WALs to work on:

{code}
  89 2020-06-11 04:38:40,113 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
 in       12309ms. Status = DONE
  90 2020-06-11 04:38:51,445 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
 in        13230ms. Status = DONE
  91 2020-06-11 04:38:57,060 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
 in       16759ms. Status = DONE
  92 2020-06-11 09:17:12,762 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
 in       1181ms. Status = DONE
  93 2020-06-11 09:17:14,512 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
 in       2224ms. Status = DONE
  94 2020-06-11 09:17:14,750 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
 in       1345ms. Status = DONE
  95 2020-06-11 09:17:19,898 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
 in       5319ms. Status = DONE
  96 2020-06-11 09:17:21,558 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
 in       890ms. Status = DONE
  97 2020-06-11 16:12:31,676 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
 in       1193ms. Status = DONE
  98 2020-06-11 16:12:34,140 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
 in       2383ms. Status = DONE
  99 2020-06-11 16:12:35,943 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
 in       2278ms. Status = DONE
 100 2020-06-11 16:12:36,437 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
 in         1497ms. Status = DONE
 101 2020-06-11 16:12:39,953 INFO  
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
Worker ps1406.example.org,16020,1591849310794 done with task 
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
 in       914ms. Status = DONE
{code}


> Classic (ZK-based) distributed WAL splitter is inscrutable in operation
> -----------------------------------------------------------------------
>
>                 Key: HBASE-24558
>                 URL: https://issues.apache.org/jira/browse/HBASE-24558
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Michael Stack
>            Priority: Major
>
> Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A 
> Replication Peer was in place) such that many RSs had hundreds of WALs. Next, 
> a power outage-like event where we have to cold-start. Master comes up and 
> files SCPs for old server instances; hundreds. SCP finds there are hundreds 
> of WALs and dutifully asks the wal-splitter to do its job. Meantime it goes 
> to sleep until completion (See HBASE-24545); no status in log, just loads of 
> complaint in log about STUCK procedure worker. Operator sees a stalled master 
> dumping into its logs the occasional split completion and STUCK procedures 
> but thats it. Operator tries poking up in zk under '/hbase/splitWAL' to see 
> whats up but gets a spew back of one-line listing thousands of znodes. 
> Operator gets no indication of how we are progressing or of when all is done.
> In my particular case, there were > 100k WALs backed-up. I could dig around 
> some and could see that 40k WALs were in-flight at any one time 
> (necessitating upping jute.maxbuffer for zk).  I could do this:
> {code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000 
> org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code}
> but it didn't help much as new WALs got added to the backlog. I could do the 
> below but it didn't help when WAL count > 100k:
> {code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}
> Is the count here all the WALs that need splitting (yes, I think). Its less 
> than zk count at any one time. How are they tied? Dunno.
> This issue is about improving visibility in our distributed wal-splitter. In 
> particular the zk-based splitter visibility is poor (HBASE-21588 is a new 
> splitter built on procedures that probably has similar issue). We need to 
> have a count of overall outstanding tasks, of how long tasks are taking, and 
> of how well distributed the job is (see below).
> An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER = 
> "hbase.regionserver.wal.max.splitters"{code} which is {code}
> public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so 
> in extremis, I upped the count to 10 but even then looking at random RS's 
> they were only getting work on occasion sometimes with hours passing between 
> split tasks as though the Master was bad at passing out the jobs: hard to 
> tell when no view on what is going on.
> Below is from log of random RS... notice how sporadically it gets split work 
> though thousands of WALs to work on:
> {code}
>   89 2020-06-11 04:38:40,113 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
>  in       12309ms. Status = DONE
>   90 2020-06-11 04:38:51,445 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
>  in        13230ms. Status = DONE
>   91 2020-06-11 04:38:57,060 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
>  in       16759ms. Status = DONE
>   92 2020-06-11 09:17:12,762 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
>  in       1181ms. Status = DONE
>   93 2020-06-11 09:17:14,512 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
>  in       2224ms. Status = DONE
>   94 2020-06-11 09:17:14,750 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
>  in       1345ms. Status = DONE
>   95 2020-06-11 09:17:19,898 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
>  in       5319ms. Status = DONE
>   96 2020-06-11 09:17:21,558 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
>  in       890ms. Status = DONE
>   97 2020-06-11 16:12:31,676 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
>  in       1193ms. Status = DONE
>   98 2020-06-11 16:12:34,140 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
>  in       2383ms. Status = DONE
>   99 2020-06-11 16:12:35,943 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
>  in       2278ms. Status = DONE
>  100 2020-06-11 16:12:36,437 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
>  in         1497ms. Status = DONE
>  101 2020-06-11 16:12:39,953 INFO  
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler: 
> Worker ps1406.example.org,16020,1591849310794 done with task 
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
>  in       914ms. Status = DONE
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to