Bryan Beaudreault created HBASE-26739:
-----------------------------------------

             Summary: Performance regression in asyncwal 
                 Key: HBASE-26739
                 URL: https://issues.apache.org/jira/browse/HBASE-26739
             Project: HBase
          Issue Type: Bug
    Affects Versions: 2.4.6
            Reporter: Bryan Beaudreault


I've been doing load testing of hbase2, using hadoop 3.3.1 (client and server). 
Comparing the results to an identical cluster running hbase/hadoop cdh5.16.2 
(hbase ~1.2.0, hadoop ~2.6.0). With a heavy write workload I would consistently 
see 99th percentile sync times of 5-6ms in cdh5, but 20-30ms in hbase2. 
Configuring hbase2 to use 'filesystem' wal provider mostly resolves this 
regression, resulting in latencies of 6-8ms. So still a little slower than 
cdh5, but much more reasonable.

Unfortunately I don't have a lot of flexibility in my environment to try 
various versions of hadoop. I didn't notice any exceptions or anything to 
indicate an API problem with hadoop 3.3.1, this may just be a general 
regression in the 2.x branch.

I briefly tried profiling wall clock time and the process was dominated by 
waiting in SyncFuture.get. I haven't dug deep enough into the code yet to know 
how to identify a bottleneck in whatever threads are responsible for completing 
those futures.

One thing to note, I tried enabling hbase.wal.async.use-shared-event-loop but 
noticed no difference.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to