[jira] [Created] (HBASE-24558) Classic (ZK-based) distributed WAL splitter is inscrutable in operation

Michael Stack (Jira) Sat, 13 Jun 2020 21:17:33 -0700

Michael Stack created HBASE-24558:
-------------------------------------

             Summary: Classic (ZK-based) distributed WAL splitter is 
inscrutable in operation
                 Key: HBASE-24558
                 URL: https://issues.apache.org/jira/browse/HBASE-24558
             Project: HBase
          Issue Type: Bug
          Components: MTTR
            Reporter: Michael Stack

Scenario: decent sized cluster (700 nodes) and WALs were backing up (TBD: A
Replication Peer was in place) such that many RSs had hundreds of WALs. Next, a
power outage-like event where we had to cold-start. Master comes up and files
SCPs for old server instances. SCP finds there are hundreds of WALs and
dutifully asks the wal-splitter to do its job. Meantime it goes to sleep until
completion (See HBASE-24545). Operator sees a stalled master dumping into its
logs the occasional split completion but thats it. Operator tries poking up in
zk under '/hbase/splitWAL' to see whats up but gets a spew back of one-line
listing thousands of znodes. Operator gets no indication of how we are
progressing or of when all is done.

In my particular case, there were > 100k WALs backed-up. I could dig around
some and could see that 40k WALs were in-flight at any one time (necessitating
upping jute.maxbuffer for zk). I could do this:

{code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000
org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code}

but it didn't help much as new WALs got added to the backlog. I could do the
below but it didn't help when WAL count > 100k:

{code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}

Is the count here all the WALs that need splitting (yes, I think). Its less
than zk count at any one time. How are they tied? Dunno.

This issue is about improving visibility in our zk-based distributed
wal-splitter. Needs to be count of overall outstanding tasks, of how long tasks
are taking, and of how well distributed the job is (see below).

An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER =
"hbase.regionserver.wal.max.splitters"{code} which is {code}
public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so in
extremis, I upped the count to 10 but even then looking at random RS's they
were only getting work on occasion sometimes with hours passing between split
tasks as though the Master was bad at passing out the jobs: hard to tell when
no view on what is going on.

Below is from log of random RS... notice how sporadically it gets split work
though thousands of WALs to work on:

{code}
89 2020-06-11 04:38:40,113 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
in 12309ms. Status = DONE
90 2020-06-11 04:38:51,445 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
in 13230ms. Status = DONE
91 2020-06-11 04:38:57,060 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
in 16759ms. Status = DONE
92 2020-06-11 09:17:12,762 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
in 1181ms. Status = DONE
93 2020-06-11 09:17:14,512 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
in 2224ms. Status = DONE
94 2020-06-11 09:17:14,750 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
in 1345ms. Status = DONE
95 2020-06-11 09:17:19,898 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
in 5319ms. Status = DONE
96 2020-06-11 09:17:21,558 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
in 890ms. Status = DONE
97 2020-06-11 16:12:31,676 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
in 1193ms. Status = DONE
98 2020-06-11 16:12:34,140 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
in 2383ms. Status = DONE
99 2020-06-11 16:12:35,943 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
in 2278ms. Status = DONE
100 2020-06-11 16:12:36,437 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
in 1497ms. Status = DONE
101 2020-06-11 16:12:39,953 INFO
[RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
Worker ps1406.example.org,16020,1591849310794 done with task
org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
in 914ms. Status = DONE
{code}

--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (HBASE-24558) Classic (ZK-based) distributed WAL splitter is inscrutable in operation

Reply via email to