[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13815493#comment-13815493 ]
stack commented on HBASE-9775: ------------------------------ Back to the root discussion on this issue: bq. with a max.total.tasks of 100 and max.perserver.tasks of 5, the client might not use all the server. May be a default of 2 for max.perserver.tasks would be better That'll work if many servers right but will be a constraint if only a few servers and a few clients. In that we will only schedule two tasks at most to each server when it could take much more. Ideally we want something like what you had before -- 5 or 1/2 the CPUs on the local server as guesstimate of how many CPUs the server has, which ever is greater-- and then soon as we get indications that server is struggling, go down from this max per server and slowly ramp back up as we have successful ops against said server (How drastic the drop in tasks-per-server should be would depend on the exception we'd gotten from the server). bq. the server reject the client when it's busy (HBASE-9467). That increases the number of retries to do, and, on an heavy load, can lead us to fail on something that would have worked before. We only reject as 'busy' when we can't obtain lock after an amount of time and if we are trying to flush because we are up against the global mem limit. Regards retries, if we get one of these RegionTooBusyExceptions, rather than back off for a 100ms or so, should we back off more (an Elliott suggestion)? And drop the number of tasks to throw at this server at any one time. It'd be hard to do as things are now given backoff is calculated based off retry count only. Give the two items above, we should keep more stats per server than just count of tasks? We should keep a history of success/error and do backoffs -- both amount of time and how many tasks to send the server -- based on this? bq. ....For example, the new settings will make the client to send 4 queries in 1 second.... Yeah, that is not going to help anyone. bq. If we want to compare 0.94 and 0.96, may be we should use the same settings, i.e. pause: 1000ms backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 } hbase.client.max.perserver.tasks: 1 Seems like good idea. [~nkeywal] What you think of the [~jeffreyz] patch? [~jmspaggi] Any luck run perf test? We got our big cluster back so we'll start in on this one again. In single client, if many regions, I see the client threads blocked waiting to do locateRegionInMeta (I don't understand this regionLockObject... it locks everyone out while a lookup is going on rather than threads contending on the same region location). If there are few regions, we are doing softvaluemap operations all the time. > Client write path perf issues > ----------------------------- > > Key: HBASE-9775 > URL: https://issues.apache.org/jira/browse/HBASE-9775 > Project: HBase > Issue Type: Bug > Components: Client > Affects Versions: 0.96.0 > Reporter: Elliott Clark > Priority: Critical > Attachments: 9775.rig.txt, 9775.rig.v2.patch, 9775.rig.v3.patch, > Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera > Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb.png, > ycsb_insert_94_vs_96.png > > > Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)