[
https://issues.apache.org/jira/browse/HBASE-11306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Kyle Purtell resolved HBASE-11306.
-----------------------------------------
Resolution: Invalid
This has been invalidated by a lot of progress. Will need to redo at some point.
> Client connection starvation issues under high load on Amazon EC2
> -----------------------------------------------------------------
>
> Key: HBASE-11306
> URL: https://issues.apache.org/jira/browse/HBASE-11306
> Project: HBase
> Issue Type: Bug
> Environment: Amazon EC2
> Reporter: Andrew Kyle Purtell
> Priority: Major
> Attachments: hbase11306-0.98.3RC2.patch
>
>
> I am using YCSB 0.1.4 with Hadoop 2.2.0 and HBase 0.98.3 RC2 on an EC2
> testbed (c3.8xlarge instances, SSD backed, 10 GigE networking). There are
> five slaves and five separate clients. I start with a prepopulated table of
> 100M rows over ~20 regions and run 5 YCSB clients concurrently targeting
> 250,000 ops/sec in aggregate. (Can reproduce this less effectively at
> 100k/ops/sec aggregate also.) Workload A. Due to how I set up the test, the
> data is all in one HFile per region and very likely in cache. All writes will
> fit in the aggregate memstore. No flushes or compactions are observed on any
> server during the test, only the occasional log roll. Despite these favorable
> conditions developed over time to isolate this issue, a few of the clients
> will stop making progress until socket timeouts after 60 seconds, leading to
> very large op latency outliers. With the above detail plus some added extra
> logging we can rule out storage layer effects. Turning to the network, this
> is where things get interesting.
> I used {{while true ; do clear ; ss -a -o|grep ESTAB|grep 8120 ; sleep 5 ;
> done}} (8120 is the configured RS data port) to watch receive and send socket
> queues and TCP level timers on all of the clients and servers simultaneously
> during the run.
> I have Nagle disabled on the clients and servers and JVM networking set up to
> use IPv4 only. The YCSB clients are configured to use 20 threads. These
> threads are expected to share 5 active connections. one to each RegionServer.
> When the test starts we see exactly what we'd expect, 5 established TCPv4
> connections.
> On all servers usually the recv and send queues were empty when sampled. I
> never saw more than 10K waiting. The servers occasionally retransmitted, but
> with timers ~200ms and retry counts ~0.
> The client side is another story. We see serious problems like:
> {noformat}
> tcp ESTAB 0 8733 10.220.15.45:41428 10.220.2.115:8120
> timer:(on,38sec,7)
> {noformat}
> That is about 9K of data still waiting to be sent after 7 TCP level
> retransmissions.
> There is some unfair queueing and packet drops happening at the network
> level, but we should be handling this better.
> During the periods when YCSB is not making progress, there is only that one
> connection to one RS in established state. There should be 5 established
> connections, one to each RS, but the other 4 have been dropped somehow. The
> one distressed connection remains established for the duration of the
> problem, while the retransmission timer count on the connection ticks upward.
> It is dropped once the socket times out at the app level. Why are the
> connections to the other RegionServers dropped? Why are all threads blocked
> waiting on the one connection for the socket timeout interval (60 seconds)?
> After the socket timeout we see the stuck connection dropped and 5 new
> connections immediately established. YCSB doesn't do anything that would lead
> to this behavior, it is using separate HTable instances for each client
> thread and not closing the table references until test cleanup. These
> behaviors seem internal to the HBase client.
> Is maintaining only a single multiplexed connection to each RegionServer the
> best approach?
> A related issue is we collect zombie sockets in ESTABLISHED state on the
> server. Also likely not our fault per se. Keepalives are enabled so they will
> eventually be garbage collected by the OS. On Linux systems this will take 2
> hours. We might want to drop connections where we don't see activity sooner
> than that. Before HBASE-11277 we were spinning indefinitely on a core for
> each connection in this state.
> I have tried this using a narrow range of recent Java 7 and Java 8 runtimes
> and they all produce the same results. I have also launched several separate
> EC2 based test clusters and they all produce the same results, so this is a
> generic platform issue.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)