[
https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815493#comment-13815493
]
stack commented on HBASE-9775:
------------------------------
Back to the root discussion on this issue:
bq. with a max.total.tasks of 100 and max.perserver.tasks of 5, the client
might not use all the server. May be a default of 2 for max.perserver.tasks
would be better
That'll work if many servers right but will be a constraint if only a few
servers and a few clients. In that we will only schedule two tasks at most to
each server when it could take much more.
Ideally we want something like what you had before -- 5 or 1/2 the CPUs on the
local server as guesstimate of how many CPUs the server has, which ever is
greater-- and then soon as we get indications that server is struggling, go
down from this max per server and slowly ramp back up as we have successful ops
against said server (How drastic the drop in tasks-per-server should be would
depend on the exception we'd gotten from the server).
bq. the server reject the client when it's busy (HBASE-9467). That increases
the number of retries to do, and, on an heavy load, can lead us to fail on
something that would have worked before.
We only reject as 'busy' when we can't obtain lock after an amount of time and
if we are trying to flush because we are up against the global mem limit.
Regards retries, if we get one of these RegionTooBusyExceptions, rather than
back off for a 100ms or so, should we back off more (an Elliott suggestion)?
And drop the number of tasks to throw at this server at any one time. It'd be
hard to do as things are now given backoff is calculated based off retry count
only.
Give the two items above, we should keep more stats per server than just count
of tasks? We should keep a history of success/error and do backoffs -- both
amount of time and how many tasks to send the server -- based on this?
bq. ....For example, the new settings will make the client to send 4 queries in
1 second....
Yeah, that is not going to help anyone.
bq. If we want to compare 0.94 and 0.96, may be we should use the same
settings, i.e. pause: 1000ms backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }
hbase.client.max.perserver.tasks: 1
Seems like good idea.
[~nkeywal] What you think of the [~jeffreyz] patch?
[~jmspaggi] Any luck run perf test?
We got our big cluster back so we'll start in on this one again.
In single client, if many regions, I see the client threads blocked waiting to
do locateRegionInMeta (I don't understand this regionLockObject... it locks
everyone out while a lookup is going on rather than threads contending on the
same region location). If there are few regions, we are doing softvaluemap
operations all the time.
> Client write path perf issues
> -----------------------------
>
> Key: HBASE-9775
> URL: https://issues.apache.org/jira/browse/HBASE-9775
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Priority: Critical
> Attachments: 9775.rig.txt, 9775.rig.v2.patch, 9775.rig.v3.patch,
> Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera
> Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb.png,
> ycsb_insert_94_vs_96.png
>
>
> Testing on larger clusters has not had the desired throughput increases.
--
This message was sent by Atlassian JIRA
(v6.1#6144)