[
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923737#comment-16923737
]
Sean Busbey commented on HBASE-22978:
-------------------------------------
Can we include the ability to have entries from the buffer(s) dumped to a
FileSystem directory? I'd like to have something flexible available for when I
need to use this with an outlier case.
Also saving the user and/or client IP of the request and being able to ask for
requests by those would be extra nice
> Online slow response log
> ------------------------
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
> Issue Type: New Feature
> Components: Admin, regionserver, shell
> Reporter: Andrew Purtell
> Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer -
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value:
> \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above
> example. We do this because the human readable representation is verbose, the
> rate of too slow warnings may be high, and the combination of these things
> can overwhelm the log capture system. The truncation is unfortunate because
> it eliminates much of the utility of the warnings. For example, the region
> name, the start and end keys, and the filter hierarchy are all important
> clues for debugging performance problems caused by moderate to low
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be
> too slow in addition to the responseTooSlow logging. The in-memory
> representation can be complete and compressed. A new admin API and shell
> command can provide access to the ring buffer for online performance
> debugging. A modest sizing of the ring buffer will prevent excessive memory
> utilization for a minor performance debugging feature by limiting the total
> number of retained records. There is some chance a high rate of requests will
> cause information on other interesting requests to be overwritten before it
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted
> in the request. We don't need to retain all key-values in the mutation, which
> may be too large to comfortably retain. We only need a unique set of row
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to
> apply fast compression to ring buffer entries (if codec support is
> available), something like snappy or zstandard, and decompress on the fly
> when servicing the retrieval API request. This will minimize the impact of
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same
> information provided by responseTooSlow warnings. Total size of response
> serialization, possibly also total cell or row counts, should be sufficient
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses <tableOrRegion> [ , \{ SERVERS=><server_list> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring
> buffer. Provide a table name as first argument to find all regions and
> retrieve too slow response entries for the given table from all servers
> currently hosting it. Provide a region name as first argument to retrieve all
> too slow response entries for the given region. Optionally provide a map of
> parameters as second argument. The SERVERS parameter, which expects a list of
> server names, will constrain the search to the given set of servers. A server
> name is its host, port, and start code, e.g.
> "host187.example.com,60020,1289493121758".
> {{get_slow_responses [ <server1> ... , <serverN> ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring
> buffer maintained by the given list of servers; or all servers on the cluster
> if no argument is provided. A server name is its host, port, and start code,
> e.g. "host187.example.com,60020,1289493121758".
> {{clear_slow_responses [ <server1> ... , <serverN> ]}}
> Clear the too slow response ring buffer maintained by the given list of
> servers; or all servers on the cluster if no argument is provided. A server
> name is its host, port, and start code, e.g.
> "host187.example.com,60020,1289493121758".
> —
> New Admin APIs:
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(String tableOrRegion, @Nullable
> List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);
> {code}
--
This message was sent by Atlassian Jira
(v8.3.2#803003)