Yes, that is indeed the problem. It is caused by 1) HBase has a fixed number (by default 30) of RPC handlers (a reasonable design choice) 2) RPC handlers block on HDFS reads (also a reasonable design choice)
As the system has a higher load of I/O intensive workloads, all RPC handlers would be blocked and no progress can be made for requests that do not require I/O. However, increasing number of threads seems to be an incomplete solution -- you run into the same problem with higher load of I/O intensive workloads... On Sat, Apr 1, 2017 at 3:47 PM, Enis Söztutar <[email protected]> wrote: > I think the problem is that you ONLY have 30 "handler" threads ( > hbase.regionserver.handler.count). Handlers are the main thread pool that > executes the RPC requests. When you do an IO bound requests, very likely > all of the 30 threads are just blocked by the disk access, so that the > total throughput drops. > > It is typical to run with 100-300 threads on the regionserver side, > depending on your settings. You can use the "Debug dump" from the > regionserver we UI or jstack to inspect what the "handler" threads are > doing. > > Enis > > On Fri, Mar 31, 2017 at 7:57 PM, 杨苏立 Yang Su Li <[email protected]> > wrote: > > > On Fri, Mar 31, 2017 at 9:39 PM, Ted Yu <[email protected]> wrote: > > > > > Can you tell us which release of hbase you used ? > > > > > > > 2.0.0 Snapshot > > > > > > > > Please describe values for the config parameters in hbase-site.xml > > > > > > The content of hbase-site.xml is shown below, but indeed this problem > is > > not sensitive to configuration -- we can reproduce the same problem with > > different configurations, and across different hbase version. > > > > > > > Do you have SSD(s) in your cluster ? > > > If so and the mixed workload involves writes, have you taken a look at > > > HBASE-12848 > > > ? > > > > > No, we don't use SSD (for hbase). And the workload does not involve > writes > > (even though workload with writes show similar behavior). I stated that > > both clients are doing 1KB Gets. > > > > <configuration> > > > > <property> > > <name>hbase-master</name> > > <value>node0.orighbasecluster.distsched-pg0.wisc.cloudlab.us:60000 > </value> > > </property> > > > > <property> > > <name>hbase.rootdir</name> > > <value>hdfs:// > > node0.orighbasecluster.distsched-pg0.wisc.cloudlab.us:9000/hbase</value> > > </property> > > > > <property> > > <name>hbase.fs.tmp.dir</name> > > <value>hdfs:// > > node0.orighbasecluster.distsched-pg0.wisc.cloudlab.us:9000/hbase-staging > > </value> > > </property> > > > > <property> > > <name>hbase.cluster.distributed</name> > > <value>true</value> > > </property> > > > > <property> > > <name>hbase.zookeeper.property.dataDir</name> > > <value>/tmp/zookeeper</value> > > </property> > > > > <property> > > <name>hbase.zookeeper.property.clientPort</name> > > <value>2181</value> > > </property> > > > > <property> > > <name>hbase.zookeeper.quorum</name> > > <value>node0.orighbasecluster.distsched-pg0.wisc.cloudlab.us</value> > > </property> > > > > <property> > > <name>hbase.ipc.server.read.threadpool.size</name> > > <value>10</value> > > </property> > > > > <property> > > <name>hbase.regionserver.handler.count</name> > > <value>30</value> > > </property> > > > > </configuration> > > > > > > > > > > > > Cheers > > > > > > On Fri, Mar 31, 2017 at 7:29 PM, 杨苏立 Yang Su Li <[email protected]> > > > wrote: > > > > > > > Hi, > > > > > > > > We found that when there is a mix of CPU-intensive and I/O intensive > > > > workload, HBase seems to slow everything down to the disk throughput > > > level. > > > > > > > > This is shown in the performance graph at > > > > http://pages.cs.wisc.edu/~suli/blocking-orig.pdf : both client-1 and > > > > client-2 are issuing 1KB Gets. From second 0 , both repeatedly > access a > > > > small set of data that is cachable and both get high throughput (~45k > > > > ops/s). At second 60, client-1 switch to an I/O intensive workload > and > > > > begins to randomly access a large set of data (does not fit in > cache). > > > > *Both* client-1 and client-2's throughput drops to ~0.5K ops/s. > > > > > > > > Is this acceptable behavior for HBase or is it considered a bug or > > > > performance drawback? > > > > I can find an old JIRA entry about similar problems ( > > > > https://issues.apache.org/jira/browse/HBASE-8836), but that was > never > > > > resolved. > > > > > > > > Thanks. > > > > > > > > Suli > > > > > > > > -- > > > > Suli Yang > > > > > > > > Department of Physics > > > > University of Wisconsin Madison > > > > > > > > 4257 Chamberlin Hall > > > > Madison WI 53703 > > > > > > > > > > > > > > > -- > > Suli Yang > > > > Department of Physics > > University of Wisconsin Madison > > > > 4257 Chamberlin Hall > > Madison WI 53703 > > > -- Suli Yang Department of Physics University of Wisconsin Madison 4257 Chamberlin Hall Madison WI 53703
