Re: HBase 2 slower than HBase 1?

Andrew Purtell Tue, 14 Jul 2020 18:23:49 -0700

I went out on vacation (and am still out) before tracking this down. If you
are waiting for me to make more progress with HBASE-24637, I can do that in
a couple of weeks, Anyone is welcome to step in sooner. ..


On Tue, Jul 14, 2020 at 11:38 AM Josh Elser <[email protected]> wrote:

> Wow. Great stuff, Andrew!
>
> Thank you for compiling and posting it all here. I can only imagine how
> time-consuming this was.
>
> On 6/26/20 1:57 PM, Andrew Purtell wrote:
> > Hey Anoop, I opened https://issues.apache.org/jira/browse/HBASE-24637
> and
> > attached the patches and script used to make the comparison.
> >
> > On Fri, Jun 26, 2020 at 2:33 AM Anoop John <[email protected]>
> wrote:
> >
> >> Great investigation Andy.  Do you know any Jiras which made changes in
> SQM?
> >> Would be great if you can attach your patch which tracks the scan
> flow.  If
> >> we have a Jira for this issue, can you pls attach?
> >>
> >> Anoop
> >>
> >> On Fri, Jun 26, 2020 at 1:56 AM Andrew Purtell <
> [email protected]>
> >> wrote:
> >>
> >>> Related, I think I found a bug in branch-1 where we don’t heartbeat in
> >> the
> >>> filter all case until we switch store files, so scanning a very large
> >> store
> >>> file might time out with client defaults. Remarking on this here so I
> >> don’t
> >>> forget to follow up.
> >>>
> >>>> On Jun 25, 2020, at 12:27 PM, Andrew Purtell <[email protected]>
> >>> wrote:
> >>>>
> >>>> 
> >>>> I repeated this test with pe --filterAll and the results were
> >> revealing,
> >>> at least for this case. I also patched in thread local hash map for
> >> atomic
> >>> counters that I could update from code paths in SQM, StoreScanner,
> >>> HFileReader*, and HFileBlock. Because a RPC is processed by a single
> >>> handler thread I could update counters and accumulate micro-timings via
> >>> System#nanoTime() per RPC and dump them out of CallRunner in some new
> >> trace
> >>> logging. I spent a couple of days making sure the instrumentation was
> >>> placed equivalently in both 1.6 and 2.2 code bases and was producing
> >>> consistent results. I can provide these patches upon request.
> >>>>
> >>>> Again, test tables with one family and 1, 5, 10, 20, 50, and 100
> >>> distinct column-qualifiers per row. After loading the table I made a
> >>> snapshot and cloned the snapshot for testing, for both 1.6 and 2.2, so
> >> both
> >>> versions were tested using the exact same data files on HDFS. I also
> used
> >>> the 1.6 version of PE for both, so the only change is on the server
> (1.6
> >> vs
> >>> 2.2 masters and regionservers).
> >>>>
> >>>> It appears a refactor to ScanQueryMatcher and friends has disabled the
> >>> ability of filters to provide SKIP hints, which prevents us from
> >> bypassing
> >>> version checking (so some extra cost in SQM), and appears to disable an
> >>> optimization that avoids reseeking, leading to a serious and
> proportional
> >>> regression in reseek activity and time spent in that code path. So for
> >>> queries that use filters, there can be a substantial regression.
> >>>>
> >>>> Other test cases that did not use filters did not show a regression.
> >>>>
> >>>> A test case where I used ROW_INDEX_V1 encoding showed an expected
> >> modest
> >>> proportional regression in seeking time, due to the fact it is
> optimized
> >>> for point queries and not optimized for the full table scan case.
> >>>>
> >>>> I will come back here when I understand this better.
> >>>>
> >>>> Here are the results for the pe --filterAll case:
> >>>>
> >>>>
> >>>> 1.6.0 c1      2.2.5 c1
> >>>> 1.6.0 c5      2.2.5 c5
> >>>> 1.6.0 c10     2.2.5 c10
> >>>> 1.6.0 c20     2.2.5 c20
> >>>> 1.6.0 c50     2.2.5 c50
> >>>> 1.6.0 c100    2.2.5 c100
> >>>> Counts
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>      (better heartbeating)
> >>>>      (better heartbeating)
> >>>>      (better heartbeating)
> >>>>      (better heartbeating)
> >>>>      (better heartbeating)
> >>>> rpcs  1       2       200%    2       6       300%    2       10
> >>> 500%    3       17      567%    4       37      925%    8       72
> >> 900%
> >>>> block_reads   11507   11508   100%    57255   57257   100%    114471
> >>> 114474  100%    230372  230377  100%    578292  578298  100%    1157955
> >>> 1157963 100%
> >>>> block_unpacks 11507   11508   100%    57255   57257   100%    114471
> >>> 114474  100%    230372  230377  100%    578292  578298  100%    1157955
> >>> 1157963 100%
> >>>> seeker_next   10000000        10000000        100%    50000000
> >>> 50000000        100%    100000000       100000000       100%
> 200000000
> >>>       200000000       100%    500000000       500000000       100%
> >>> 1000000000      1000000000      100%
> >>>> store_next    10000000        9988268 100%    50000000        49940082
> >>>        100%    100000000       99879401        100%    200000000
> >>>   199766539       100%    500000000       499414653       100%
> >>> 1000000000      998836518       100%
> >>>> store_reseek  1       11733   > !     2       59924   > !     8
> >>>   120607  > !     6       233467  > !     10      585357  > !     8
> >>>   1163490 > !
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> cells_matched 20000000        20000000        100%    60000000
> >>> 60000000        100%    110000000       110000000       100%
> 210000000
> >>>       210000000       100%    510000000       510000000       100%
> >>> 1010000000      1010000000      100%
> >>>> column_hint_include   10000000        10000000        100%    50000000
> >>>        50000000        100%    100000000       100000000       100%
> >>> 200000000       200000000       100%    500000000       500000000
> >>>   100%    1000000000      1000000000      100%
> >>>> filter_hint_skip      10000000        10000000        100%    50000000
> >>>        50000000        100%    100000000       100000000       100%
> >>> 200000000       200000000       100%    500000000       500000000
> >>>   100%    1000000000      1000000000      100%
> >>>> sqm_hint_done 9999999 9999999 100%    9999998 9999998 100%    9999998
> >>> 9999998 100%    9999997 9999997 100%    9999996 9999996 100%    9999992
> >>> 9999992 100%
> >>>> sqm_hint_seek_next_col        1       0       0%      2       40000002
> >>>        > !     2       90000002        > !     3       190000003
>  > !
> >>>     4       490000004       > !     8       990000008       > !
> >>>> sqm_hint_seek_next_row        0       10000000        > !     2
> >>>   10000000        > !     0       10000000        > !     0
>  10000000
> >>>        > !     0       10000000        > !     0       10000000
> > !
> >>>> sqm_hint_skip 10000000        0       0%      50000000        0
> >>>   0%      100000000       0       0%      200000000       0       0%
> >>> 500000000       0       0%      1000000000      0       0%
> >>>> versions_hint_include_and_seek_next_col       0       0       0%
> 0
> >>>       40000000        > !     0       90000000        > !     0
> >>>   190000000       > !     0       490000000       > !     0
> >>   990000000
> >>>       > !
> >>>> versions_hint_include_and_seek_next_row       0       0       0%
> 0
> >>>       10000000        > !     0       10000000        > !     0
> >>>   10000000        > !     0       10000000        > !     0
>  10000000
> >>>        > !
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Times
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> block_read_ms 174.12  215.96  124%    683.78  883.02  129%    1724.66
> >>> 1937.87 112%    3199.93 3470.92 108%    7582.96 7941.45 105%
> 14701.52
> >>>      16022.57        109%
> >>>> block_unpack_ms       0.63    0.62    99%     2.35    2.35    100%
> >>> 4.36    4.56    105%    9.10    8.30    91%     21.59   22.01   102%
> >>> 40.97   40.93   100%
> >>>> seeker_next_ms        1073.45 1053.51 98%     4452.52 4728.96 106%
> >>> 9359.61 9706.06 104%    18879.28        18330.93        97%
>  47664.95
> >>>      46404.26        97%     96511.83        92121.30        95%
> >>>> store_next_ms 2651.11 2423.27 91%     12048.64        11398.63
> >>> 95%     24833.68        23586.20        95%     48493.26
> 46574.38
> >>>      96%     120786.00       115603.00       96%     240627.00
> >>>   227945.00       95%
> >>>> store_reseek_ms       2.71    62.17   2297%   10.93   233.67  2139%
> >>>   10.88   401.88  3693%   13.98   783.18  5601%   22.26   1939.28 8714%
> >>>   24.32   4065.25 16714%
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Rows=10000000
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> ValueSize=20
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> BlockEncoding=FAST_DIFF
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> Compress=NONE
> >>>> Filter=ALL
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <[email protected]>
> >>> wrote:
> >>>>> I used PE to generate 10M row tables with one family with either 1,
> >> 10,
> >>> 20, 50, or 100 values per row (unique column-qualifiers). An increase
> in
> >>> wall clock time was noticeable, for example:
> >>>>>
> >>>>> 1.6.0
> >>>>>
> >>>>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
> >>> --columns=20 --nomapred scan 2
> >>>>> real 1m20.179s
> >>>>> user 0m45.470s
> >>>>> sys 0m16.572s
> >>>>>
> >>>>> 2.2.5
> >>>>>
> >>>>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
> >>> --columns=20 --nomapred scan 2
> >>>>> real 1m31.016s
> >>>>> user 0m48.658s
> >>>>> sys 0m16.965s
> >>>>>
> >>>>> It didnt really make a difference if I used 1 thread or 4 or 10, the
> >>> delta was about the same, proportionally. I picked two threads in the
> end
> >>> so I'd have enough time to launch async-profiler twice in another
> shell,
> >> to
> >>> capture flamegraph and call tree output, respectively. async-profiler
> >>> captured 10 seconds at steady state per test case. Upon first
> inspection
> >>> what jumps out is an increasing proportion of CPU time spent in GC in
> >> 2.2.5
> >>> vs 1.6.0. The difference increases as the number of column families
> >>> increase. There is little apparent difference at 1 column family, but a
> >> 2x
> >>> or more difference at 20 columns, and a 10x or more difference at 100
> >>> columns, eyeballing the charts, flipping back and forth between browser
> >>> windows. This seems more than coincidental but obviously calls for
> >> capture
> >>> and analysis of GC trace, with JFR. Will do that next.
> >>>>>
> >>>>> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
> >>> 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM
> (Zulu
> >>> 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)
> >>>>>
> >>>>> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC
> >>> -XX:+AlwaysPreTouch -XX:+UseNUMA -XX:-UseBiasedLocking
> >>> -XX:+ParallelRefProcEnabled
> >>>>>
> >>>>>
> >>>>> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <[email protected]>
> >>> wrote:
> >>>>>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
> >>>>>> improvement, but we're not there yet.
> >>>>>>
> >>>>>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase
> >> 1.2.0
> >>>>>> in a test that simply scans all the regions in parallel without any
> >>>>>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than
> >>>>>> 1.2.0. So that is better than 2.1.0, but still substantially slower
> >>>>>> than what hbase 1.2.0 was.
> >>>>>>
> >>>>>> As before, I tested this both on a 3 node cluster as well as with a
> >>>>>> unit test using HBaseTestingUtility. Both tests show very similar
> >>>>>> relative differences.
> >>>>>>
> >>>>>> Jan
> >>>>>>
> >>>>>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <[email protected]>
> >>> wrote:
> >>>>>>>
> >>>>>>> In another mail thread Zheng Hu brought up an important Jra fix
> >>>>>>> https://issues.apache.org/jira/browse/HBASE-21657
> >>>>>>> Can u pls check with this once?
> >>>>>>>
> >>>>>>> Anoop
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <[email protected]>
> >>> wrote:
> >>>>>>>
> >>>>>>>> On Sun, Jun 7, 2020 at 7:49 AM Anoop John <[email protected]
> >>>
> >>> wrote:
> >>>>>>>>> As per the above configs, it looks like Bucket Cache is not
> >>> being used.
> >>>>>>>>> Only on heap LRU cache in use.
> >>>>>>>>
> >>>>>>>> True (but it is large enough to hold everything, so I don't think
> >>> it
> >>>>>>>> matters).
> >>>>>>>>
> >>>>>>>>> @Jan - Is it possible for you to test with off heap Bucket
> >> Cache?
> >>>>>>>>   Config
> >>>>>>>>> bucket cache off heap mode with size ~7.5 GB
> >>>>>>>>
> >>>>>>>> I did a quick test but it seems not to make a measurable
> >>> difference.
> >>>>>>>> If anything, it is actually slightly slower even. I see 100% hit
> >>> ratio
> >>>>>>>> in the L1
> >>>>>>>> LruBlockCache and effectively also 100% in the L2 BucketCache
> >> (hit
> >>>>>>>> ratio is not yet at 100% but hits increase with every test and
> >>> misses
> >>>>>>>> do not).
> >>>>>>>>
> >>>>>>>> Given that the LruBlockCache was already large enough to cache
> >> all
> >>> the
> >>>>>>>> data anyway, I did not expect this to help either, to be honest.
> >>>>>>>>
> >>>>>>>>> Do you have any DataBlockEncoding enabled on the CF?
> >>>>>>>>
> >>>>>>>> Yes, FAST_DIFF. But this is of course true in both the tests with
> >>>>>>>> hbase2 and hbase1.
> >>>>>>>>
> >>>>>>>> Jan
> >>>>>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards,
> >>>>> Andrew
> >>>>>
> >>>>> Words like orphans lost among the crosstalk, meaning torn from
> truth's
> >>> decrepit hands
> >>>>>     - A23, Crosstalk
> >>>>
> >>>>
> >>>> --
> >>>> Best regards,
> >>>> Andrew
> >>>>
> >>>> Words like orphans lost among the crosstalk, meaning torn from truth's
> >>> decrepit hands
> >>>>     - A23, Crosstalk
> >>>
> >>
> >
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HBase 2 slower than HBase 1?

Reply via email to