[ 
https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815493#comment-13815493
 ] 

stack commented on HBASE-9775:
------------------------------

Back to the root discussion on this issue:

bq. with a max.total.tasks of 100 and max.perserver.tasks of 5, the client 
might not use all the server. May be a default of 2 for max.perserver.tasks 
would be better

That'll work if many servers right but will be a constraint if only a few 
servers and a few clients. In that we will only schedule two tasks at most to 
each server when it could take much more.

Ideally we want something like what you had before -- 5 or 1/2 the CPUs on the 
local server as guesstimate of how many CPUs the server has, which ever is 
greater-- and then soon as we get indications that server is struggling, go 
down from this max per server and slowly ramp back up as we have successful ops 
against said server (How drastic the drop in tasks-per-server should be would 
depend on the exception we'd gotten from the server).

bq. the server reject the client when it's busy (HBASE-9467). That increases 
the number of retries to do, and, on an heavy load, can lead us to fail on 
something that would have worked before.

We only reject as 'busy' when we can't obtain lock after an amount of time and 
if we are trying to flush because we are up against the global mem limit.  
Regards retries, if we get one of these RegionTooBusyExceptions, rather than 
back off for a 100ms or so, should we back off more (an Elliott suggestion)?  
And drop the number of tasks to throw at this server at any one time.   It'd be 
hard to do as things are now given backoff is calculated based off retry count 
only.

Give the two items above, we should keep more stats per server than just count 
of tasks?  We should keep a history of success/error and do backoffs -- both 
amount of time and how many tasks to send the server -- based on this?

bq. ....For example, the new settings will make the client to send 4 queries in 
1 second....

Yeah, that is not going to help anyone.

bq. If we want to compare 0.94 and 0.96, may be we should use the same 
settings, i.e. pause: 1000ms backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 } 
hbase.client.max.perserver.tasks: 1

Seems like good idea.

[~nkeywal] What you think of the [~jeffreyz] patch?

[~jmspaggi] Any luck run perf test?

We got our big cluster back so we'll start in on this one again.

In single client, if many regions, I see the client threads blocked waiting to 
do locateRegionInMeta (I don't understand this regionLockObject... it locks 
everyone out while a lookup is going on rather than threads contending on the 
same region location).  If there are few regions, we are doing softvaluemap 
operations all the time.








> Client write path perf issues
> -----------------------------
>
>                 Key: HBASE-9775
>                 URL: https://issues.apache.org/jira/browse/HBASE-9775
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Priority: Critical
>         Attachments: 9775.rig.txt, 9775.rig.v2.patch, 9775.rig.v3.patch, 
> Charts Search   Cloudera Manager - ITBLL.png, Charts Search   Cloudera 
> Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb.png, 
> ycsb_insert_94_vs_96.png
>
>
> Testing on larger clusters has not had the desired throughput increases.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to