[
https://issues.apache.org/jira/browse/HBASE-22978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932156#comment-16932156
]
Viraj Jasani commented on HBASE-22978:
--------------------------------------
In this case, is it fine to have "optionally dumping logs to HDFS" as a
sub-task to this?
> Online slow response log
> ------------------------
>
> Key: HBASE-22978
> URL: https://issues.apache.org/jira/browse/HBASE-22978
> Project: HBase
> Issue Type: New Feature
> Components: Admin, Operability, regionserver, shell
> Reporter: Andrew Purtell
> Priority: Minor
>
> Today when an individual RPC exceeds a configurable time bound we log a
> complaint by way of the logging subsystem. These log lines look like:
> {noformat}
> 2019-08-30 22:10:36,195 WARN [,queue=15,port=60020] ipc.RpcServer -
> (responseTooSlow):
> {"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ScanRequest)",
> "starttimems":1567203007549,
> "responsesize":6819737,
> "method":"Scan",
> "param":"region { type: REGION_NAME value:
> \"tsdb,\\000\\000\\215\\f)o\\\\\\024\\302\\220\\000\\000\\000\\000\\000\\001\\000\\000\\000\\000\\000\\006\\000\\000\\000\\000\\000\\005\\000\\000<TRUNCATED>",
> "processingtimems":28646,
> "client":"10.253.196.215:41116",
> "queuetimems":22453,
> "class":"HRegionServer"}
> {noformat}
> Unfortunately we often truncate the request parameters, like in the above
> example. We do this because the human readable representation is verbose, the
> rate of too slow warnings may be high, and the combination of these things
> can overwhelm the log capture system. The truncation is unfortunate because
> it eliminates much of the utility of the warnings. For example, the region
> name, the start and end keys, and the filter hierarchy are all important
> clues for debugging performance problems caused by moderate to low
> selectivity queries or queries made at a high rate.
> We can maintain an in-memory ring buffer of requests that were judged to be
> too slow in addition to the responseTooSlow logging. The in-memory
> representation can be complete and compressed. A new admin API and shell
> command can provide access to the ring buffer for online performance
> debugging. A modest sizing of the ring buffer will prevent excessive memory
> utilization for a minor performance debugging feature by limiting the total
> number of retained records. There is some chance a high rate of requests will
> cause information on other interesting requests to be overwritten before it
> can be read. This is the nature of a ring buffer and an acceptable trade off.
> The write request types do not require us to retain all information submitted
> in the request. We don't need to retain all key-values in the mutation, which
> may be too large to comfortably retain. We only need a unique set of row
> keys, or even a min/max range, and total counts.
> The consumers of this information will be debugging tools. We can afford to
> apply fast compression to ring buffer entries (if codec support is
> available), something like snappy or zstandard, and decompress on the fly
> when servicing the retrieval API request. This will minimize the impact of
> retaining more information about slow requests than we do today.
> This proposal is for retention of request information only, the same
> information provided by responseTooSlow warnings. Total size of response
> serialization, possibly also total cell or row counts, should be sufficient
> to characterize the response.
> Optionally persist new entries added to the ring buffer into one or more
> files in HDFS in a write-behind manner. If the HDFS writer blocks or falls
> behind and we are unable to persist an entry before it is overwritten, that
> is fine. Response too slow logging is best effort. If we can detect this make
> a note of it in the log file. Provide a tool for parsing, dumping, filtering,
> and pretty printing the slow logs written to HDFS. The tool and the shell can
> share and reuse some utility classes and methods for accomplishing that.
> —
> New shell commands:
> {{get_slow_responses [ <server1> ... , <serverN> ] [ , \{ <filter-attributes>
> } ]}}
> Retrieve, decode, and pretty print the contents of the too slow response ring
> buffer maintained by the given list of servers; or all servers in the cluster
> if no list is provided. Optionally provide a map of parameters for filtering
> as additional argument. The TABLE filter, which expects a string containing a
> table name, will include only entries pertaining to that table. The REGION
> filter, which expects a string containing an encoded region name, will
> include only entries pertaining to that region. The CLIENT_IP filter, which
> expects a string containing an IP address, will include only entries
> pertaining to that client. The USER filter, which expects a string containing
> a user name, will include only entries pertaining to that user. Filters are
> additive, for example if both CLIENT_IP and USER filters are given, entries
> matching either or both conditions will be included. The exception to this is
> if both TABLE and REGION filters are provided, REGION will be preferred and
> TABLE will be ignored. A server name is its host, port, and start code, e.g.
> "host187.example.com,60020,1289493121758".
> {{clear_slow_responses [ <server1> ... , <serverN> ]}}
> Clear the too slow response ring buffer maintained by the given list of
> servers; or all servers on the cluster if no argument is provided. A server
> name is its host, port, and start code, e.g.
> "host187.example.com,60020,1289493121758".
> —
> New Admin APIs:
> {code:java}
> List<ResponseDetail> Admin#getSlowResponses(@Nullable List<String> servers);
> {code}
> {code:java}
> List<ResponseDetail> Admin#clearSlowResponses(@Nullable List<String> servers);
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)