Andrew Purtell created HBASE-11306:
--------------------------------------
Summary: Client connection starvation issues under high load
Key: HBASE-11306
URL: https://issues.apache.org/jira/browse/HBASE-11306
Project: HBase
Issue Type: Bug
Environment: Amazon EC2
Reporter: Andrew Purtell
I am using YCSB 0.1.4 with Hadoop 2.2.0 and HBase 0.98.3 RC2 on an EC2 testbed
(c3.8xlarge instances, SSD backed, 10 GigE networking). There are five slaves
and five separate clients. I start with a prepopulated table of 100M rows over
~20 regions and run 5 YCSB clients concurrently targeting 250,000 ops/sec in
aggregate. (Can reproduce this less effectively at 100k/ops/sec aggregate
also.) Workload A. Due to how I set up the test, the data is all in one HFile
per region and very likely in cache. All writes will fit in the aggregate
memstore. No flushes or compactions are observed on any server during the test,
only the occasional log roll. Despite these favorable conditions developed over
time to isolate this issue, a few of the clients will stop making progress
until socket timeouts after 60 seconds, leading to very large op latency
outliers. With the above detail plus some added extra logging we can rule out
storage layer effects. Turning to the network, this is where things get
interesting.
I used {{while true ; do clear ; ss -a -o|grep ESTAB|grep 8120 ; sleep 5 ;
done}} (8120 is the configured RS data port) to watch receive and send socket
queues and TCP level timers on all of the clients and servers simultaneously
during the run.
I have Nagle disabled on the clients and servers and JVM networking set up to
use IPv4 only. The YCSB clients are configured to use 20 threads. These threads
are expected to share 5 active connections. one to each RegionServer. When the
test starts we see exactly what we'd expect, 5 established TCPv4 connections.
On all servers usually the recv and send queues were empty when sampled. I
never saw more than 10K waiting. The servers occasionally retransmitted, but
with timers ~200ms and retry counts ~0.
The client side is another story. We see serious problems like:
{noformat}
tcp ESTAB 0 8733 10.220.15.45:41428 10.220.2.115:8120
timer:(on,38sec,7)
{noformat}
That is about 9K of data still waiting to be sent after 7 TCP level
retransmissions.
There is some unfair queueing happening at the network level, but we should be
handling this better.
During the periods when YCSB is not making progress, there is only that one
connection to one RS in established state. There should be 5 established
connections, one to each RS, but the other 4 have been dropped somehow. The one
distressed connection remains established for the duration of the problem,
while the retransmission timer count on the connection ticks upward. It is
dropped once the socket times out at the app level. Why are the connections to
the other RegionServers dropped? Why are all threads blocked waiting on the one
connection for the socket timeout interval (60 seconds)? After the socket
timeout we see the stuck connection dropped and 5 new connections immediately
established. YCSB doesn't do anything that would lead to this behavior, it is
using separate HTable instances for each client thread and not closing the
table references until test cleanup. These behaviors seem internal to the HBase
client.
Is maintaining only a single multiplexed connection to each RegionServer the
best approach?
A related issue is we collect zombie sockets in ESTABLISHED state on the
server. Also likely not our fault per se. Keepalives are enabled so they will
eventually be garbage collected by the OS. On Linux systems this will take 2
hours. We might want to drop connections where we don't see activity sooner
than that. Before HBASE-11277 we were spinning indefinitely on a core for each
connection in this state.
I have tried this using a narrow range of recent Java 7 and Java 8 runtimes and
they all produce the same results. I have also launched several separate EC2
based test clusters and they all produce the same results, so this is a generic
platform issue.
--
This message was sent by Atlassian JIRA
(v6.2#6252)