Vikas Vishwakarma created HBASE-19236:
-----------------------------------------
Summary: Tune client backoff trigger logic and backoff time in
ExponentialClientBackoffPolicy
Key: HBASE-19236
URL: https://issues.apache.org/jira/browse/HBASE-19236
Project: HBase
Issue Type: Improvement
Reporter: Vikas Vishwakarma
We were evaluating the ExponentialClientBackoffPolicy (HBASE-12986) for
implementing basic service protection, usage quota allocation for few heavy
loading clients, especially M/R job based HBase clients. However it was
observed that ExponentialClientBackoffPolicy slows down the client dramatically
even when there is not much load on the HBase cluster.
For a simple multithreaded write throughput client without enabling
ExponentialClientBackoffPolicy was able to complete in less than 5 mins running
on a 40 node cluster (~100G data).
The same client took ~10 hours to complete with ExponentialClientBackoffPolicy
enabled with default settings DEFAULT_MAX_BACKOFF of 5 mins
Even after reducing the DEFAULT_MAX_BACKOFF of 1 min, the client took ~2 hours
to complete
Current ExponentialClientBackoffPolicy decides the backoff time based on 3
factors
// Factor in memstore load
double percent = regionStats.getMemstoreLoadPercent() / 100.0;
// Factor in heap occupancy
float heapOccupancy = regionStats.getHeapOccupancyPercent() / 100.0f;
// Factor in compaction pressure, 1.0 means heavy compaction pressure
float compactionPressure = regionStats.getCompactionPressure() / 100.0f;
However according to our test observations it looks like the client backoff is
getting triggered even when there is hardly any load on the cluster. We need to
evaluate the existing logic or probably implement a different policy more
customized and suitable to our needs.
One of the ideas is to base it directly on compactionQueueLength instead of
heap occupancy etc. Consider a case where there is high throughput write load
and the compaction is still able keep up with the rate of memstore flushes and
compact all the files being flushed at the same rate. In this case memstore can
be full and heap occupancy can be high but still it is not necessary indicator
that the service is falling behind on processing the client load and there is a
need to backoff the client as we are just utilizing the full write throughput
of the system which is good. However if the compactionQueue starts building up
and is continuously above a threshold and increasing then that is a reliable
indicator that the system is not able to keep up with the input load and is
slowly falling behind.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)