Re: HBase 2 slower than HBase 1?

Andrew Purtell Fri, 26 Jun 2020 10:58:34 -0700

Hey Anoop, I opened https://issues.apache.org/jira/browse/HBASE-24637 and
attached the patches and script used to make the comparison.


On Fri, Jun 26, 2020 at 2:33 AM Anoop John <[email protected]> wrote:

> Great investigation Andy.  Do you know any Jiras which made changes in SQM?
> Would be great if you can attach your patch which tracks the scan flow.  If
> we have a Jira for this issue, can you pls attach?
>
> Anoop
>
> On Fri, Jun 26, 2020 at 1:56 AM Andrew Purtell <[email protected]>
> wrote:
>
> > Related, I think I found a bug in branch-1 where we don’t heartbeat in
> the
> > filter all case until we switch store files, so scanning a very large
> store
> > file might time out with client defaults. Remarking on this here so I
> don’t
> > forget to follow up.
> >
> > > On Jun 25, 2020, at 12:27 PM, Andrew Purtell <[email protected]>
> > wrote:
> > >
> > > 
> > > I repeated this test with pe --filterAll and the results were
> revealing,
> > at least for this case. I also patched in thread local hash map for
> atomic
> > counters that I could update from code paths in SQM, StoreScanner,
> > HFileReader*, and HFileBlock. Because a RPC is processed by a single
> > handler thread I could update counters and accumulate micro-timings via
> > System#nanoTime() per RPC and dump them out of CallRunner in some new
> trace
> > logging. I spent a couple of days making sure the instrumentation was
> > placed equivalently in both 1.6 and 2.2 code bases and was producing
> > consistent results. I can provide these patches upon request.
> > >
> > > Again, test tables with one family and 1, 5, 10, 20, 50, and 100
> > distinct column-qualifiers per row. After loading the table I made a
> > snapshot and cloned the snapshot for testing, for both 1.6 and 2.2, so
> both
> > versions were tested using the exact same data files on HDFS. I also used
> > the 1.6 version of PE for both, so the only change is on the server (1.6
> vs
> > 2.2 masters and regionservers).
> > >
> > > It appears a refactor to ScanQueryMatcher and friends has disabled the
> > ability of filters to provide SKIP hints, which prevents us from
> bypassing
> > version checking (so some extra cost in SQM), and appears to disable an
> > optimization that avoids reseeking, leading to a serious and proportional
> > regression in reseek activity and time spent in that code path. So for
> > queries that use filters, there can be a substantial regression.
> > >
> > > Other test cases that did not use filters did not show a regression.
> > >
> > > A test case where I used ROW_INDEX_V1 encoding showed an expected
> modest
> > proportional regression in seeking time, due to the fact it is optimized
> > for point queries and not optimized for the full table scan case.
> > >
> > > I will come back here when I understand this better.
> > >
> > > Here are the results for the pe --filterAll case:
> > >
> > >
> > > 1.6.0 c1      2.2.5 c1
> > > 1.6.0 c5      2.2.5 c5
> > > 1.6.0 c10     2.2.5 c10
> > > 1.6.0 c20     2.2.5 c20
> > > 1.6.0 c50     2.2.5 c50
> > > 1.6.0 c100    2.2.5 c100
> > > Counts
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >     (better heartbeating)
> > >     (better heartbeating)
> > >     (better heartbeating)
> > >     (better heartbeating)
> > >     (better heartbeating)
> > > rpcs  1       2       200%    2       6       300%    2       10
> > 500%    3       17      567%    4       37      925%    8       72
> 900%
> > > block_reads   11507   11508   100%    57255   57257   100%    114471
> > 114474  100%    230372  230377  100%    578292  578298  100%    1157955
> > 1157963 100%
> > > block_unpacks 11507   11508   100%    57255   57257   100%    114471
> > 114474  100%    230372  230377  100%    578292  578298  100%    1157955
> > 1157963 100%
> > > seeker_next   10000000        10000000        100%    50000000
> > 50000000        100%    100000000       100000000       100%    200000000
> >      200000000       100%    500000000       500000000       100%
> > 1000000000      1000000000      100%
> > > store_next    10000000        9988268 100%    50000000        49940082
> >       100%    100000000       99879401        100%    200000000
> >  199766539       100%    500000000       499414653       100%
> > 1000000000      998836518       100%
> > > store_reseek  1       11733   > !     2       59924   > !     8
> >  120607  > !     6       233467  > !     10      585357  > !     8
> >  1163490 > !
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > cells_matched 20000000        20000000        100%    60000000
> > 60000000        100%    110000000       110000000       100%    210000000
> >      210000000       100%    510000000       510000000       100%
> > 1010000000      1010000000      100%
> > > column_hint_include   10000000        10000000        100%    50000000
> >       50000000        100%    100000000       100000000       100%
> > 200000000       200000000       100%    500000000       500000000
> >  100%    1000000000      1000000000      100%
> > > filter_hint_skip      10000000        10000000        100%    50000000
> >       50000000        100%    100000000       100000000       100%
> > 200000000       200000000       100%    500000000       500000000
> >  100%    1000000000      1000000000      100%
> > > sqm_hint_done 9999999 9999999 100%    9999998 9999998 100%    9999998
> > 9999998 100%    9999997 9999997 100%    9999996 9999996 100%    9999992
> > 9999992 100%
> > > sqm_hint_seek_next_col        1       0       0%      2       40000002
> >       > !     2       90000002        > !     3       190000003       > !
> >    4       490000004       > !     8       990000008       > !
> > > sqm_hint_seek_next_row        0       10000000        > !     2
> >  10000000        > !     0       10000000        > !     0       10000000
> >       > !     0       10000000        > !     0       10000000        > !
> > > sqm_hint_skip 10000000        0       0%      50000000        0
> >  0%      100000000       0       0%      200000000       0       0%
> > 500000000       0       0%      1000000000      0       0%
> > > versions_hint_include_and_seek_next_col       0       0       0%      0
> >      40000000        > !     0       90000000        > !     0
> >  190000000       > !     0       490000000       > !     0
>  990000000
> >      > !
> > > versions_hint_include_and_seek_next_row       0       0       0%      0
> >      10000000        > !     0       10000000        > !     0
> >  10000000        > !     0       10000000        > !     0       10000000
> >       > !
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Times
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > block_read_ms 174.12  215.96  124%    683.78  883.02  129%    1724.66
> > 1937.87 112%    3199.93 3470.92 108%    7582.96 7941.45 105%    14701.52
> >     16022.57        109%
> > > block_unpack_ms       0.63    0.62    99%     2.35    2.35    100%
> > 4.36    4.56    105%    9.10    8.30    91%     21.59   22.01   102%
> > 40.97   40.93   100%
> > > seeker_next_ms        1073.45 1053.51 98%     4452.52 4728.96 106%
> > 9359.61 9706.06 104%    18879.28        18330.93        97%     47664.95
> >     46404.26        97%     96511.83        92121.30        95%
> > > store_next_ms 2651.11 2423.27 91%     12048.64        11398.63
> > 95%     24833.68        23586.20        95%     48493.26        46574.38
> >     96%     120786.00       115603.00       96%     240627.00
> >  227945.00       95%
> > > store_reseek_ms       2.71    62.17   2297%   10.93   233.67  2139%
> >  10.88   401.88  3693%   13.98   783.18  5601%   22.26   1939.28 8714%
> >  24.32   4065.25 16714%
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Rows=10000000
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > ValueSize=20
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > BlockEncoding=FAST_DIFF
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Compress=NONE
> > > Filter=ALL
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <[email protected]>
> > wrote:
> > >> I used PE to generate 10M row tables with one family with either 1,
> 10,
> > 20, 50, or 100 values per row (unique column-qualifiers). An increase in
> > wall clock time was noticeable, for example:
> > >>
> > >> 1.6.0
> > >>
> > >> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
> > --columns=20 --nomapred scan 2
> > >> real 1m20.179s
> > >> user 0m45.470s
> > >> sys 0m16.572s
> > >>
> > >> 2.2.5
> > >>
> > >> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
> > --columns=20 --nomapred scan 2
> > >> real 1m31.016s
> > >> user 0m48.658s
> > >> sys 0m16.965s
> > >>
> > >> It didnt really make a difference if I used 1 thread or 4 or 10, the
> > delta was about the same, proportionally. I picked two threads in the end
> > so I'd have enough time to launch async-profiler twice in another shell,
> to
> > capture flamegraph and call tree output, respectively. async-profiler
> > captured 10 seconds at steady state per test case. Upon first inspection
> > what jumps out is an increasing proportion of CPU time spent in GC in
> 2.2.5
> > vs 1.6.0. The difference increases as the number of column families
> > increase. There is little apparent difference at 1 column family, but a
> 2x
> > or more difference at 20 columns, and a 10x or more difference at 100
> > columns, eyeballing the charts, flipping back and forth between browser
> > windows. This seems more than coincidental but obviously calls for
> capture
> > and analysis of GC trace, with JFR. Will do that next.
> > >>
> > >> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
> > 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu
> > 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)
> > >>
> > >> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC
> > -XX:+AlwaysPreTouch -XX:+UseNUMA -XX:-UseBiasedLocking
> > -XX:+ParallelRefProcEnabled
> > >>
> > >>
> > >> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <[email protected]>
> > wrote:
> > >>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
> > >>> improvement, but we're not there yet.
> > >>>
> > >>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase
> 1.2.0
> > >>> in a test that simply scans all the regions in parallel without any
> > >>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than
> > >>> 1.2.0. So that is better than 2.1.0, but still substantially slower
> > >>> than what hbase 1.2.0 was.
> > >>>
> > >>> As before, I tested this both on a 3 node cluster as well as with a
> > >>> unit test using HBaseTestingUtility. Both tests show very similar
> > >>> relative differences.
> > >>>
> > >>> Jan
> > >>>
> > >>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <[email protected]>
> > wrote:
> > >>> >
> > >>> > In another mail thread Zheng Hu brought up an important Jra fix
> > >>> > https://issues.apache.org/jira/browse/HBASE-21657
> > >>> > Can u pls check with this once?
> > >>> >
> > >>> > Anoop
> > >>> >
> > >>> >
> > >>> > On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <[email protected]>
> > wrote:
> > >>> >
> > >>> > > On Sun, Jun 7, 2020 at 7:49 AM Anoop John <[email protected]
> >
> > wrote:
> > >>> > > > As per the above configs, it looks like Bucket Cache is not
> > being used.
> > >>> > > > Only on heap LRU cache in use.
> > >>> > >
> > >>> > > True (but it is large enough to hold everything, so I don't think
> > it
> > >>> > > matters).
> > >>> > >
> > >>> > > > @Jan - Is it possible for you to test with off heap Bucket
> Cache?
> > >>> > >  Config
> > >>> > > > bucket cache off heap mode with size ~7.5 GB
> > >>> > >
> > >>> > > I did a quick test but it seems not to make a measurable
> > difference.
> > >>> > > If anything, it is actually slightly slower even. I see 100% hit
> > ratio
> > >>> > > in the L1
> > >>> > > LruBlockCache and effectively also 100% in the L2 BucketCache
> (hit
> > >>> > > ratio is not yet at 100% but hits increase with every test and
> > misses
> > >>> > > do not).
> > >>> > >
> > >>> > > Given that the LruBlockCache was already large enough to cache
> all
> > the
> > >>> > > data anyway, I did not expect this to help either, to be honest.
> > >>> > >
> > >>> > > > Do you have any DataBlockEncoding enabled on the CF?
> > >>> > >
> > >>> > > Yes, FAST_DIFF. But this is of course true in both the tests with
> > >>> > > hbase2 and hbase1.
> > >>> > >
> > >>> > > Jan
> > >>> > >
> > >>
> > >>
> > >> --
> > >> Best regards,
> > >> Andrew
> > >>
> > >> Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> > >>    - A23, Crosstalk
> > >
> > >
> > > --
> > > Best regards,
> > > Andrew
> > >
> > > Words like orphans lost among the crosstalk, meaning torn from truth's
> > decrepit hands
> > >    - A23, Crosstalk
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Re: HBase 2 slower than HBase 1?

Reply via email to