[
https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797182#comment-13797182
]
Nicolas Liochon commented on HBASE-9775:
----------------------------------------
Thanks, Elliot.
bq. So there should be 2 clients per region server.
That something that would work fine with the 0.94 out of the box, right?
Is there anything on the server that could explain the server timeout
(SocketTimeoutException)?
With 150 clients, and a client being able to send 2 queries per server, a
server can receive 300 queries simultaneously.
On average it should be less: a client can have only 100 tasks, so it will be
200 (but it's an average: a server can be unlucky and receives these 300
requests). The limit on the threads don't hold here: there should be less then
250 threads per client.
Here are the differences I see between the 0.94 and 0.96 that could be related.
I may be wrong, I'm not sure about all backports.
- with the settings above, a server would have received 150 queries max (1 per
client), instead of 300 now worse, 150 average.
- the server reject the client when it's busy (HBASE-9467). That increases the
number of retries to do, and, on an heavy load, can lead us to fail on
something that would have worked before.
- we're much more aggressive on the time before retrying (100ms vs 1000ms),
the backoff is different. It was { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }, it's
now { 1, 2, 3, 5, 10, 100 }. The number of retries was 10 it's now 31. But we
increase the server load as we're retrying more aggressively. For example, the
new settings will make the client to send 4 queries in 1 second when they fail.
If the servers can handle the load, it's great. If there are 150 clients like
this may be not.
- we now stop after ~5 minutes (calculated from the number of retries & back
off time), this whatever the number of retries actually made. I'm not sure that
the point here (I would need the debug logs to know), but I've seen it on this
tests on other clusters (we were not doing all the retries).
Is there anything that I forgot?
If we want to compare 0.94 and 0.96, may be we should use the same settings,
i.e.
pause: 1000ms
backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }
hbase.client.max.perserver.tasks: 1
This does not match exactly (the 0.96 will still send more tasks at peaks, as
it will always sends data to all servers for example, and there is still the
time limit and the effect of HBASE-9467 that makes me more comfortable with
more retries), but we're not too far hopefully. We can use
hbase.client.max.total.tasks if we need to control the clients more.
I'm not sure it should be the default (at least for the backoff, the strategy
was to improve latency vs. server load). But it could be recommended for
upgrade and/or map reduce tasks.
Lastly, what's the configuration of the box?
> Client write path perf issues
> -----------------------------
>
> Key: HBASE-9775
> URL: https://issues.apache.org/jira/browse/HBASE-9775
> Project: HBase
> Issue Type: Bug
> Components: Client
> Affects Versions: 0.96.0
> Reporter: Elliott Clark
> Priority: Critical
> Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts
> Search Cloudera Manager.png, job_run.log, short_ycsb.png,
> ycsb_insert_94_vs_96.png
>
>
> Testing on larger clusters has not had the desired throughput increases.
--
This message was sent by Atlassian JIRA
(v6.1#6144)