Wow. Great stuff, Andrew!

Thank you for compiling and posting it all here. I can only imagine how time-consuming this was.

On 6/26/20 1:57 PM, Andrew Purtell wrote:
Hey Anoop, I opened https://issues.apache.org/jira/browse/HBASE-24637 and
attached the patches and script used to make the comparison.

On Fri, Jun 26, 2020 at 2:33 AM Anoop John <anoop.hb...@gmail.com> wrote:

Great investigation Andy.  Do you know any Jiras which made changes in SQM?
Would be great if you can attach your patch which tracks the scan flow.  If
we have a Jira for this issue, can you pls attach?

Anoop

On Fri, Jun 26, 2020 at 1:56 AM Andrew Purtell <andrew.purt...@gmail.com>
wrote:

Related, I think I found a bug in branch-1 where we don’t heartbeat in
the
filter all case until we switch store files, so scanning a very large
store
file might time out with client defaults. Remarking on this here so I
don’t
forget to follow up.

On Jun 25, 2020, at 12:27 PM, Andrew Purtell <apurt...@apache.org>
wrote:


I repeated this test with pe --filterAll and the results were
revealing,
at least for this case. I also patched in thread local hash map for
atomic
counters that I could update from code paths in SQM, StoreScanner,
HFileReader*, and HFileBlock. Because a RPC is processed by a single
handler thread I could update counters and accumulate micro-timings via
System#nanoTime() per RPC and dump them out of CallRunner in some new
trace
logging. I spent a couple of days making sure the instrumentation was
placed equivalently in both 1.6 and 2.2 code bases and was producing
consistent results. I can provide these patches upon request.

Again, test tables with one family and 1, 5, 10, 20, 50, and 100
distinct column-qualifiers per row. After loading the table I made a
snapshot and cloned the snapshot for testing, for both 1.6 and 2.2, so
both
versions were tested using the exact same data files on HDFS. I also used
the 1.6 version of PE for both, so the only change is on the server (1.6
vs
2.2 masters and regionservers).

It appears a refactor to ScanQueryMatcher and friends has disabled the
ability of filters to provide SKIP hints, which prevents us from
bypassing
version checking (so some extra cost in SQM), and appears to disable an
optimization that avoids reseeking, leading to a serious and proportional
regression in reseek activity and time spent in that code path. So for
queries that use filters, there can be a substantial regression.

Other test cases that did not use filters did not show a regression.

A test case where I used ROW_INDEX_V1 encoding showed an expected
modest
proportional regression in seeking time, due to the fact it is optimized
for point queries and not optimized for the full table scan case.

I will come back here when I understand this better.

Here are the results for the pe --filterAll case:


1.6.0 c1      2.2.5 c1
1.6.0 c5      2.2.5 c5
1.6.0 c10     2.2.5 c10
1.6.0 c20     2.2.5 c20
1.6.0 c50     2.2.5 c50
1.6.0 c100    2.2.5 c100
Counts






















     (better heartbeating)
     (better heartbeating)
     (better heartbeating)
     (better heartbeating)
     (better heartbeating)
rpcs  1       2       200%    2       6       300%    2       10
500%    3       17      567%    4       37      925%    8       72
900%
block_reads   11507   11508   100%    57255   57257   100%    114471
114474  100%    230372  230377  100%    578292  578298  100%    1157955
1157963 100%
block_unpacks 11507   11508   100%    57255   57257   100%    114471
114474  100%    230372  230377  100%    578292  578298  100%    1157955
1157963 100%
seeker_next   10000000        10000000        100%    50000000
50000000        100%    100000000       100000000       100%    200000000
      200000000       100%    500000000       500000000       100%
1000000000      1000000000      100%
store_next    10000000        9988268 100%    50000000        49940082
       100%    100000000       99879401        100%    200000000
  199766539       100%    500000000       499414653       100%
1000000000      998836518       100%
store_reseek  1       11733   > !     2       59924   > !     8
  120607  > !     6       233467  > !     10      585357  > !     8
  1163490 > !



















cells_matched 20000000        20000000        100%    60000000
60000000        100%    110000000       110000000       100%    210000000
      210000000       100%    510000000       510000000       100%
1010000000      1010000000      100%
column_hint_include   10000000        10000000        100%    50000000
       50000000        100%    100000000       100000000       100%
200000000       200000000       100%    500000000       500000000
  100%    1000000000      1000000000      100%
filter_hint_skip      10000000        10000000        100%    50000000
       50000000        100%    100000000       100000000       100%
200000000       200000000       100%    500000000       500000000
  100%    1000000000      1000000000      100%
sqm_hint_done 9999999 9999999 100%    9999998 9999998 100%    9999998
9999998 100%    9999997 9999997 100%    9999996 9999996 100%    9999992
9999992 100%
sqm_hint_seek_next_col        1       0       0%      2       40000002
       > !     2       90000002        > !     3       190000003       > !
    4       490000004       > !     8       990000008       > !
sqm_hint_seek_next_row        0       10000000        > !     2
  10000000        > !     0       10000000        > !     0       10000000
       > !     0       10000000        > !     0       10000000        > !
sqm_hint_skip 10000000        0       0%      50000000        0
  0%      100000000       0       0%      200000000       0       0%
500000000       0       0%      1000000000      0       0%
versions_hint_include_and_seek_next_col       0       0       0%      0
      40000000        > !     0       90000000        > !     0
  190000000       > !     0       490000000       > !     0
  990000000
      > !
versions_hint_include_and_seek_next_row       0       0       0%      0
      10000000        > !     0       10000000        > !     0
  10000000        > !     0       10000000        > !     0       10000000
       > !



















Times




































block_read_ms 174.12  215.96  124%    683.78  883.02  129%    1724.66
1937.87 112%    3199.93 3470.92 108%    7582.96 7941.45 105%    14701.52
     16022.57        109%
block_unpack_ms       0.63    0.62    99%     2.35    2.35    100%
4.36    4.56    105%    9.10    8.30    91%     21.59   22.01   102%
40.97   40.93   100%
seeker_next_ms        1073.45 1053.51 98%     4452.52 4728.96 106%
9359.61 9706.06 104%    18879.28        18330.93        97%     47664.95
     46404.26        97%     96511.83        92121.30        95%
store_next_ms 2651.11 2423.27 91%     12048.64        11398.63
95%     24833.68        23586.20        95%     48493.26        46574.38
     96%     120786.00       115603.00       96%     240627.00
  227945.00       95%
store_reseek_ms       2.71    62.17   2297%   10.93   233.67  2139%
  10.88   401.88  3693%   13.98   783.18  5601%   22.26   1939.28 8714%
  24.32   4065.25 16714%






































Rows=10000000

















ValueSize=20

















BlockEncoding=FAST_DIFF

















Compress=NONE
Filter=ALL







































On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <apurt...@apache.org>
wrote:
I used PE to generate 10M row tables with one family with either 1,
10,
20, 50, or 100 values per row (unique column-qualifiers). An increase in
wall clock time was noticeable, for example:

1.6.0

time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
--columns=20 --nomapred scan 2
real 1m20.179s
user 0m45.470s
sys 0m16.572s

2.2.5

time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20
--columns=20 --nomapred scan 2
real 1m31.016s
user 0m48.658s
sys 0m16.965s

It didnt really make a difference if I used 1 thread or 4 or 10, the
delta was about the same, proportionally. I picked two threads in the end
so I'd have enough time to launch async-profiler twice in another shell,
to
capture flamegraph and call tree output, respectively. async-profiler
captured 10 seconds at steady state per test case. Upon first inspection
what jumps out is an increasing proportion of CPU time spent in GC in
2.2.5
vs 1.6.0. The difference increases as the number of column families
increase. There is little apparent difference at 1 column family, but a
2x
or more difference at 20 columns, and a 10x or more difference at 100
columns, eyeballing the charts, flipping back and forth between browser
windows. This seems more than coincidental but obviously calls for
capture
and analysis of GC trace, with JFR. Will do that next.

JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu
8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM (Zulu
8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode)

Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC
-XX:+AlwaysPreTouch -XX:+UseNUMA -XX:-UseBiasedLocking
-XX:+ParallelRefProcEnabled


On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <ja...@ngdata.com>
wrote:
This is promising, thanks a lot. Testing with hbase 2.2.5 shows an
improvement, but we're not there yet.

As reported earlier, hbase 2.1.0 was about 60% slower than hbase
1.2.0
in a test that simply scans all the regions in parallel without any
filter. A test with hbase 2.2.5 shows it to be about 40% slower than
1.2.0. So that is better than 2.1.0, but still substantially slower
than what hbase 1.2.0 was.

As before, I tested this both on a 3 node cluster as well as with a
unit test using HBaseTestingUtility. Both tests show very similar
relative differences.

Jan

On Thu, Jun 11, 2020 at 2:16 PM Anoop John <anoop.hb...@gmail.com>
wrote:

In another mail thread Zheng Hu brought up an important Jra fix
https://issues.apache.org/jira/browse/HBASE-21657
Can u pls check with this once?

Anoop


On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <ja...@ngdata.com>
wrote:

On Sun, Jun 7, 2020 at 7:49 AM Anoop John <anoop.hb...@gmail.com

wrote:
As per the above configs, it looks like Bucket Cache is not
being used.
Only on heap LRU cache in use.

True (but it is large enough to hold everything, so I don't think
it
matters).

@Jan - Is it possible for you to test with off heap Bucket
Cache?
  Config
bucket cache off heap mode with size ~7.5 GB

I did a quick test but it seems not to make a measurable
difference.
If anything, it is actually slightly slower even. I see 100% hit
ratio
in the L1
LruBlockCache and effectively also 100% in the L2 BucketCache
(hit
ratio is not yet at 100% but hits increase with every test and
misses
do not).

Given that the LruBlockCache was already large enough to cache
all
the
data anyway, I did not expect this to help either, to be honest.

Do you have any DataBlockEncoding enabled on the CF?

Yes, FAST_DIFF. But this is of course true in both the tests with
hbase2 and hbase1.

Jan



--
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
    - A23, Crosstalk


--
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
    - A23, Crosstalk




Reply via email to