[jira] [Commented] (HBASE-22978) Online slow response log

Andrew Purtell (Jira) Thu, 05 Sep 2019 14:52:51 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16923760#comment-16923760
 ]


Andrew Purtell commented on HBASE-22978:
----------------------------------------

I'll update the description to include write-behind of the ring buffer to a 
directory in HDFS, but it shouldn't block, so if we stall during writing some 
ring buffer entries may have been lost. If we can detect that we can log that 
it happened in the file. 

bq. Also saving the user and/or client IP of the request and being able to ask 
for requests by those would be extra nice

Ok, will include user and client IP in the request details set aside. 

As for query APIs, the admin API is sugar over fan out requests to 
regionservers for whatever is currently sitting in the ring buffers. Where we 
want to narrow the search by region or table we can get region locations and 
prune the regionserver set. Filtering or sorting on other attributes would be 
done locally in the client. I think it best to let the client index the list of 
ResponseDetail however it likes. 

The shell commands are one client of the admin APIs. This seems a good place to 
put additional convenience filtering. Will update the description for this too. 

> Online slow response log
> ------------------------
>
>                 Key: HBASE-22978
>                 URL: https://issues.apache.org/jira/browse/HBASE-22978
>             Project: HBase
>          Issue Type: New Feature
>          Components: Admin, regionserver, shell
>            Reporter: Andrew Purtell
>            Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a 
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer - 
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value: 
> \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above 
> example. We do this because the human readable representation is verbose, the 
> rate of too slow warnings may be high, and the combination of these things 
> can overwhelm the log capture system. The truncation is unfortunate because 
> it eliminates much of the utility of the warnings. For example, the region 
> name, the start and end keys, and the filter hierarchy are all important 
> clues for debugging performance problems caused by moderate to low 
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be 
> too slow in addition to the responseTooSlow logging. The in-memory 
> representation can be complete and compressed. A new admin API and shell 
> command can provide access to the ring buffer for online performance 
> debugging. A modest sizing of the ring buffer will prevent excessive memory 
> utilization for a minor performance debugging feature by limiting the total 
> number of retained records. There is some chance a high rate of requests will 
> cause information on other interesting requests to be overwritten before it 
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted 
> in the request. We don't need to retain all key-values in the mutation, which 
> may be too large to comfortably retain. We only need a unique set of row 
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to 
> apply fast compression to ring buffer entries (if codec support is 
> available), something like snappy or zstandard, and decompress on the fly 
> when servicing the retrieval API request. This will minimize the impact of 
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same 
> information provided by responseTooSlow warnings. Total size of response 
> serialization, possibly also total cell or row counts, should be sufficient 
> to characterize the response.
> —
> New shell commands:
> {{get_slow_responses <tableOrRegion> [ , \{ SERVERS=><server_list> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer. Provide a table name as first argument to find all regions and 
> retrieve too slow response entries for the given table from all servers 
> currently hosting it. Provide a region name as first argument to retrieve all 
> too slow response entries for the given region. Optionally provide a map of 
> parameters as second argument. The SERVERS parameter, which expects a list of 
> server names, will constrain the search to the given set of servers. A server 
> name is its host, port, and start code, e.g. 
> "host187.example.com,60020,1289493121758".
> {{get_slow_responses [ <server1> ... , <serverN> ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring 
> buffer maintained by the given list of servers; or all servers on the cluster 
> if no argument is provided. A server name is its host, port, and start code, 
> e.g. "host187.example.com,60020,1289493121758".
> {{clear_slow_responses [ <server1> ... , <serverN> ]}}
> Clear the too slow response ring buffer maintained by the given list of 
> servers; or all servers on the cluster if no argument is provided. A server 
> name is its host, port, and start code, e.g. 
> "host187.example.com,60020,1289493121758".
> —
> New Admin APIs:
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(String tableOrRegion, @Nullable 
> List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (HBASE-22978) Online slow response log

Reply via email to