[
https://issues.apache.org/jira/browse/HBASE-24558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136023#comment-17136023
]
Michael Stack commented on HBASE-24558:
---------------------------------------
[~anoop.hbase]
bq. I saw that but we will do that threads addition only when we see existing
are stuck!
Yes. In this case the SCPs are 'stuck' waiting on split logs to complete --
they block -- so we spin up a new threads, and so on.
bq. So many RS split thread trying to read WAL files concurrently and trying
to write so many files.
Our HDFS access is throttled. Default we only allow two splitters to run per
RS. In my case, because the split throughput was poor -- about 20 splits a
minute -- I tried upping the splitters per RS to 10... but it didn't help; we
seem to be bad at distributing out the work across the cluster (hard to tell
where bottleneck is).
While our HDFS access is throttled, the SCP wait on wal-splitting completion
involves a query to zk reading the content of the '/hbase/splitWAL' dir every
100ms so 160 threads each listing a dir of 40k znodes.... was making zk servers
work hard (HBASE-24545 added backoff if lots of ourstanding WALs to split).
[~zghao] thanks. I'm trying HBASE-21588 now. It looks great. No blocking. While
it also has poor visibility, it is a better foundation to build on given it
uses our general procedure engine; any improvement in procedure insight will be
generally beneficial (and we at least have our procedures&locks UI and
command-line utils to list outstanding procedures...).
[~ram_krish] thanks. I forgot about the config which says wait longer before
preempting. Now you mention it I was seeing a bunch of reassigning of tasks
(here the WALs were also large). hbase.splitlog.manager.timeout whose default
is two minutes. I think we should up our default here -- 5 minutes? For
example, picking a random server, the longest split took over a minute even
though the WAL was small ... 256M ... but it split to 80 Regions which probably
took time. Looking at another server, 70 seconds. We should only preempt at an
extreme.
[~pankajkumar]
bq. Currently we are working on reducing the HDFS call in WAL split and making
WAL split and region assignment async in SCP.
Interesting. Is your work based on current splitter or are you working on top
of HBASE-21588?
bq. May be because SplitLogManager submit the SplitTask in async way, if ZK is
bottleneck then this will also cause delay in WAL split submit & completion.
Yes. It was hard to tell why we were bottlenecking but for sure, the available
resources were not being utilized.
Your review of NN load sounds interesting. Did you publish it anywhere sir?
Thanks.
> Classic (ZK-based) distributed WAL splitter is inscrutable in operation
> -----------------------------------------------------------------------
>
> Key: HBASE-24558
> URL: https://issues.apache.org/jira/browse/HBASE-24558
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: Michael Stack
> Priority: Major
>
> Scenario: decent sized cluster (700 nodes) and WALs backing up (TBD: A
> Replication Peer was in place) such that many RSs had hundreds of WALs. Next,
> a power outage-like event where we have to cold-start. Master comes up and
> files SCPs for old server instances; hundreds. SCP finds there are hundreds
> of WALs and dutifully asks the wal-splitter to do its job. Meantime it goes
> to sleep until completion (See HBASE-24545); no status in log, just loads of
> complaint in log about STUCK procedure worker. Operator sees a stalled master
> dumping into its logs the occasional split completion and STUCK procedures
> but thats it. Operator tries poking up in zk under '/hbase/splitWAL' to see
> whats up but gets a spew back of one-line listing thousands of znodes.
> Operator gets no indication of how we are progressing or of when all is done.
> In my particular case, there were > 100k WALs backed-up. I could dig around
> some and could see that 40k WALs were in-flight at any one time
> (necessitating upping jute.maxbuffer for zk). I could do this:
> {code}$ echo "ls /hbase/splitWAL" | hbase -Djute.maxbuffer=10000000
> org.apache.hadoop.hbase.zookeeper.ZKMainServer | sed -e 's/, /\n/g'|wc{code}
> but it didn't help much as new WALs got added to the backlog. I could do the
> below but it didn't help when WAL count > 100k:
> {code}$ hdfs dfs -lsr /hbase/WALs/*|wc{code}
> Is the count here all the WALs that need splitting (yes, I think). Its less
> than zk count at any one time. How are they tied? Dunno.
> This issue is about improving visibility in our distributed wal-splitter. In
> particular the zk-based splitter visibility is poor (HBASE-21588 is a new
> splitter built on procedures that probably has similar issue). We need to
> have a count of overall outstanding tasks, of how long tasks are taking, and
> of how well distributed the job is (see below).
> An aside was that I had default for {code}HBASE_SPLIT_WAL_MAX_SPLITTER =
> "hbase.regionserver.wal.max.splitters"{code} which is {code}
> public static final int DEFAULT_HBASE_SPLIT_WAL_MAX_SPLITTER = 2; {code} so
> in extremis, I upped the count to 10 but even then looking at random RS's
> they were only getting work on occasion sometimes with hours passing between
> split tasks as though the Master was bad at passing out the jobs: hard to
> tell when no view on what is going on.
> Below is from log of random RS... notice how sporadically it gets split work
> though thousands of WALs to work on:
> {code}
> 89 2020-06-11 04:38:40,113 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@271eba8d
> in 12309ms. Status = DONE
> 90 2020-06-11 04:38:51,445 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6c9e0b6
> in 13230ms. Status = DONE
> 91 2020-06-11 04:38:57,060 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@7f102aeb
> in 16759ms. Status = DONE
> 92 2020-06-11 09:17:12,762 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@28c31f7d
> in 1181ms. Status = DONE
> 93 2020-06-11 09:17:14,512 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@3ecd4959
> in 2224ms. Status = DONE
> 94 2020-06-11 09:17:14,750 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@5a72b3ad
> in 1345ms. Status = DONE
> 95 2020-06-11 09:17:19,898 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@6af0858d
> in 5319ms. Status = DONE
> 96 2020-06-11 09:17:21,558 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@1bca0d8b
> in 890ms. Status = DONE
> 97 2020-06-11 16:12:31,676 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@50a124bc
> in 1193ms. Status = DONE
> 98 2020-06-11 16:12:34,140 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@519c681c
> in 2383ms. Status = DONE
> 99 2020-06-11 16:12:35,943 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@4a47b062
> in 2278ms. Status = DONE
> 100 2020-06-11 16:12:36,437 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-0] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@f9bcea
> in 1497ms. Status = DONE
> 101 2020-06-11 16:12:39,953 INFO
> [RS_LOG_REPLAY_OPS-regionserver/ps1406:16020-1] handler.WALSplitterHandler:
> Worker ps1406.example.org,16020,1591849310794 done with task
> org.apache.hadoop.hbase.coordination.ZkSplitLogWorkerCoordination$ZkSplitTaskDetails@42d82ac4
> in 914ms. Status = DONE
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)