[ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923737#comment-16923737
 ] 

Sean Busbey commented on HBASE-22978:
-------------------------------------

Can we include the ability to have entries from the buffer(s) dumped to a 
FileSystem directory? I'd like to have something flexible available for when I 
need to use this with an outlier case.


Also saving the user and/or client IP of the request and being able to ask for 
requests by those would be extra nice

> Online slow response log
> ------------------------
>
>                 Key: HBASE-22978
>                 URL: https://issues.apache.org/jira/browse/HBASE-22978
>             Project: HBase
>          Issue Type: New Feature
>          Components: Admin, regionserver, shell
>            Reporter: Andrew Purtell
>            Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses <tableOrRegion> [ , \{ SERVERS=><server_list> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer. Provide a table name as first argument to find all regions and 
> retrieve too slow response entries for the given table from all servers 
> currently hosting it. Provide a region name as first argument to retrieve all 
> too slow response entries for the given region. Optionally provide a map of 
> parameters as second argument. The SERVERS parameter, which expects a list of 
> server names, will constrain the search to the given set of servers. A server 
> name is its host, port, and start code, e.g. 
> "host187.example.com,60020,1289493121758".
> {{get_slow_responses [ <server1> ... , <serverN> ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers on the cluster 
> if no argument is provided. A server name is its host, port, and start code, 
> e.g. "host187.example.com,60020,1289493121758".
> {{clear_slow_responses [ <server1> ... , <serverN> ]}}
> Clear the too slow response ring buffer maintained by the given list of 
> servers; or all servers on the cluster if no argument is provided. A server 
> name is its host, port, and start code, e.g. 
> "host187.example.com,60020,1289493121758".
> —
> New Admin APIs:
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(String tableOrRegion, @Nullable 
> List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to