Andrew Purtell created HBASE-22978:
--------------------------------------
Summary: Online slow response log
Key: HBASE-22978
URL: https://issues.apache.org/jira/browse/HBASE-22978
Project: HBase
Issue Type: New Feature
Components: Admin, regionserver, shell
Reporter: Andrew Purtell
Today when an individual RPC exceeds a configurable time bound we log a
complaint by way of the logging subsystem. These log lines look like:
{noformat}
2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer -
(responseTooSlow):
{"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)","starttimems":1567203007549,"responsesize":6819737,"method":"Scan","param":"region
{ type: REGION_NAME value:
\"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>","processingtimems":28646,"client":"10.253.196.215:41116","queuetimems":22453,"class":"HRegionServer"}
{noformat}
Unfortunately we often truncate the request parameters, like in the above
example. We do this because the human readable representation is verbose, the
rate of too slow warnings may be high, and the combination of these things can
overwhelm the log capture system. (All of these things have been reported from
various production settings. Truncation was added after we crashed a user's log
capture system.) The truncation is unfortunate because it eliminates much of
the utility of the warnings. For example, the region name, the start and end
keys, and the filter hierarchy are all important clues for debugging
performance problems caused by moderate to low selectivity queries or queries
made at a high rate.
We can maintain an in-memory ring buffer of requests that were judged to be too
slow in addition to the responseTooSlow logging. The in-memory representation
can be complete and compressed. A new admin API and shell command can provide
access to the ring buffer for online performance debugging. A modest sizing of
the ring buffer will prevent excessive memory utilization for a minor
performance debugging feature by limiting the total number of retained records.
There is some chance a high rate of requests will cause information on other
interesting requests to be overwritten before it can be read. This is the
nature of a ring buffer and an acceptable trade off.
The write request types do not require us to retain all information submitted
in the request. We don't need to retain all key-values in the mutation, which
may be too large to comfortably retain. We only need a unique set of row keys,
or even a min/max range, and total counts.
The consumers of this information will be debugging tools. We can afford to
apply fast compression to ring buffer entries (if codec support is available),
something like snappy or zstandard, and decompress on the fly when servicing
the retrieval API request. This will minimize the impact of retaining more
information about slow requests than we do today.
This proposal is for retention of request information only, the same
information provided by responseTooSlow warnings. Total size of response
serialization, possibly also total cell or row counts, should be sufficient to
characterize the response.
—
New shell commands:
{{get_slow_responses <tableOrRegion> [ , \{ SERVERS=><server_list> } ]}}
Retrieve, decode, and pretty print the contents of the too slow response ring
buffer. Provide a table name as first argument to find all regions and retrieve
too slow response entries for the given table from all servers currently
hosting it. Provide a region name as first argument to retrieve all too slow
response entries for the given region. Optionally provide a map of parameters
as second argument. The SERVERS parameter, which expects a list of server
names, will constrain the search to the given set of servers. A server name is
its host, port, and start code, e.g. "host187.example.com,60020,1289493121758".
{{get_slow_responses [ <server1> ... , <serverN> ]}}
Retrieve, decode, and pretty print the contents of the too slow response ring
buffer maintained by the given list of servers; or all servers on the cluster
if no argument is provided. A server name is its host, port, and start code,
e.g. "host187.example.com,60020,1289493121758".
{{clear_slow_responses [ <server1> ... , <serverN> ]}}
Clear the too slow response ring buffer maintained by the given list of
servers; or all servers on the cluster if no argument is provided. A server
name is its host, port, and start code, e.g.
"host187.example.com,60020,1289493121758".
—
New Admin APIs:
{code:java}
List<ResponseDetail> Admin#getSlowResponses(String tableOrRegion, @Nullable
List<String> servers);
{code}
{code:java}
List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);
{code}
{code:java}
List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);
{code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)