I went out on vacation (and am still out) before tracking this down. If you are waiting for me to make more progress with HBASE-24637, I can do that in a couple of weeks, Anyone is welcome to step in sooner. ..
On Tue, Jul 14, 2020 at 11:38 AM Josh Elser <els...@apache.org> wrote: > Wow. Great stuff, Andrew! > > Thank you for compiling and posting it all here. I can only imagine how > time-consuming this was. > > On 6/26/20 1:57 PM, Andrew Purtell wrote: > > Hey Anoop, I opened https://issues.apache.org/jira/browse/HBASE-24637 > and > > attached the patches and script used to make the comparison. > > > > On Fri, Jun 26, 2020 at 2:33 AM Anoop John <anoop.hb...@gmail.com> > wrote: > > > >> Great investigation Andy. Do you know any Jiras which made changes in > SQM? > >> Would be great if you can attach your patch which tracks the scan > flow. If > >> we have a Jira for this issue, can you pls attach? > >> > >> Anoop > >> > >> On Fri, Jun 26, 2020 at 1:56 AM Andrew Purtell < > andrew.purt...@gmail.com> > >> wrote: > >> > >>> Related, I think I found a bug in branch-1 where we don’t heartbeat in > >> the > >>> filter all case until we switch store files, so scanning a very large > >> store > >>> file might time out with client defaults. Remarking on this here so I > >> don’t > >>> forget to follow up. > >>> > >>>> On Jun 25, 2020, at 12:27 PM, Andrew Purtell <apurt...@apache.org> > >>> wrote: > >>>> > >>>> > >>>> I repeated this test with pe --filterAll and the results were > >> revealing, > >>> at least for this case. I also patched in thread local hash map for > >> atomic > >>> counters that I could update from code paths in SQM, StoreScanner, > >>> HFileReader*, and HFileBlock. Because a RPC is processed by a single > >>> handler thread I could update counters and accumulate micro-timings via > >>> System#nanoTime() per RPC and dump them out of CallRunner in some new > >> trace > >>> logging. I spent a couple of days making sure the instrumentation was > >>> placed equivalently in both 1.6 and 2.2 code bases and was producing > >>> consistent results. I can provide these patches upon request. > >>>> > >>>> Again, test tables with one family and 1, 5, 10, 20, 50, and 100 > >>> distinct column-qualifiers per row. After loading the table I made a > >>> snapshot and cloned the snapshot for testing, for both 1.6 and 2.2, so > >> both > >>> versions were tested using the exact same data files on HDFS. I also > used > >>> the 1.6 version of PE for both, so the only change is on the server > (1.6 > >> vs > >>> 2.2 masters and regionservers). > >>>> > >>>> It appears a refactor to ScanQueryMatcher and friends has disabled the > >>> ability of filters to provide SKIP hints, which prevents us from > >> bypassing > >>> version checking (so some extra cost in SQM), and appears to disable an > >>> optimization that avoids reseeking, leading to a serious and > proportional > >>> regression in reseek activity and time spent in that code path. So for > >>> queries that use filters, there can be a substantial regression. > >>>> > >>>> Other test cases that did not use filters did not show a regression. > >>>> > >>>> A test case where I used ROW_INDEX_V1 encoding showed an expected > >> modest > >>> proportional regression in seeking time, due to the fact it is > optimized > >>> for point queries and not optimized for the full table scan case. > >>>> > >>>> I will come back here when I understand this better. > >>>> > >>>> Here are the results for the pe --filterAll case: > >>>> > >>>> > >>>> 1.6.0 c1 2.2.5 c1 > >>>> 1.6.0 c5 2.2.5 c5 > >>>> 1.6.0 c10 2.2.5 c10 > >>>> 1.6.0 c20 2.2.5 c20 > >>>> 1.6.0 c50 2.2.5 c50 > >>>> 1.6.0 c100 2.2.5 c100 > >>>> Counts > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> (better heartbeating) > >>>> (better heartbeating) > >>>> (better heartbeating) > >>>> (better heartbeating) > >>>> (better heartbeating) > >>>> rpcs 1 2 200% 2 6 300% 2 10 > >>> 500% 3 17 567% 4 37 925% 8 72 > >> 900% > >>>> block_reads 11507 11508 100% 57255 57257 100% 114471 > >>> 114474 100% 230372 230377 100% 578292 578298 100% 1157955 > >>> 1157963 100% > >>>> block_unpacks 11507 11508 100% 57255 57257 100% 114471 > >>> 114474 100% 230372 230377 100% 578292 578298 100% 1157955 > >>> 1157963 100% > >>>> seeker_next 10000000 10000000 100% 50000000 > >>> 50000000 100% 100000000 100000000 100% > 200000000 > >>> 200000000 100% 500000000 500000000 100% > >>> 1000000000 1000000000 100% > >>>> store_next 10000000 9988268 100% 50000000 49940082 > >>> 100% 100000000 99879401 100% 200000000 > >>> 199766539 100% 500000000 499414653 100% > >>> 1000000000 998836518 100% > >>>> store_reseek 1 11733 > ! 2 59924 > ! 8 > >>> 120607 > ! 6 233467 > ! 10 585357 > ! 8 > >>> 1163490 > ! > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> cells_matched 20000000 20000000 100% 60000000 > >>> 60000000 100% 110000000 110000000 100% > 210000000 > >>> 210000000 100% 510000000 510000000 100% > >>> 1010000000 1010000000 100% > >>>> column_hint_include 10000000 10000000 100% 50000000 > >>> 50000000 100% 100000000 100000000 100% > >>> 200000000 200000000 100% 500000000 500000000 > >>> 100% 1000000000 1000000000 100% > >>>> filter_hint_skip 10000000 10000000 100% 50000000 > >>> 50000000 100% 100000000 100000000 100% > >>> 200000000 200000000 100% 500000000 500000000 > >>> 100% 1000000000 1000000000 100% > >>>> sqm_hint_done 9999999 9999999 100% 9999998 9999998 100% 9999998 > >>> 9999998 100% 9999997 9999997 100% 9999996 9999996 100% 9999992 > >>> 9999992 100% > >>>> sqm_hint_seek_next_col 1 0 0% 2 40000002 > >>> > ! 2 90000002 > ! 3 190000003 > > ! > >>> 4 490000004 > ! 8 990000008 > ! > >>>> sqm_hint_seek_next_row 0 10000000 > ! 2 > >>> 10000000 > ! 0 10000000 > ! 0 > 10000000 > >>> > ! 0 10000000 > ! 0 10000000 > > ! > >>>> sqm_hint_skip 10000000 0 0% 50000000 0 > >>> 0% 100000000 0 0% 200000000 0 0% > >>> 500000000 0 0% 1000000000 0 0% > >>>> versions_hint_include_and_seek_next_col 0 0 0% > 0 > >>> 40000000 > ! 0 90000000 > ! 0 > >>> 190000000 > ! 0 490000000 > ! 0 > >> 990000000 > >>> > ! > >>>> versions_hint_include_and_seek_next_row 0 0 0% > 0 > >>> 10000000 > ! 0 10000000 > ! 0 > >>> 10000000 > ! 0 10000000 > ! 0 > 10000000 > >>> > ! > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Times > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> block_read_ms 174.12 215.96 124% 683.78 883.02 129% 1724.66 > >>> 1937.87 112% 3199.93 3470.92 108% 7582.96 7941.45 105% > 14701.52 > >>> 16022.57 109% > >>>> block_unpack_ms 0.63 0.62 99% 2.35 2.35 100% > >>> 4.36 4.56 105% 9.10 8.30 91% 21.59 22.01 102% > >>> 40.97 40.93 100% > >>>> seeker_next_ms 1073.45 1053.51 98% 4452.52 4728.96 106% > >>> 9359.61 9706.06 104% 18879.28 18330.93 97% > 47664.95 > >>> 46404.26 97% 96511.83 92121.30 95% > >>>> store_next_ms 2651.11 2423.27 91% 12048.64 11398.63 > >>> 95% 24833.68 23586.20 95% 48493.26 > 46574.38 > >>> 96% 120786.00 115603.00 96% 240627.00 > >>> 227945.00 95% > >>>> store_reseek_ms 2.71 62.17 2297% 10.93 233.67 2139% > >>> 10.88 401.88 3693% 13.98 783.18 5601% 22.26 1939.28 8714% > >>> 24.32 4065.25 16714% > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Rows=10000000 > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> ValueSize=20 > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> BlockEncoding=FAST_DIFF > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> Compress=NONE > >>>> Filter=ALL > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Jun 11, 2020 at 5:08 PM Andrew Purtell <apurt...@apache.org> > >>> wrote: > >>>>> I used PE to generate 10M row tables with one family with either 1, > >> 10, > >>> 20, 50, or 100 values per row (unique column-qualifiers). An increase > in > >>> wall clock time was noticeable, for example: > >>>>> > >>>>> 1.6.0 > >>>>> > >>>>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 > >>> --columns=20 --nomapred scan 2 > >>>>> real 1m20.179s > >>>>> user 0m45.470s > >>>>> sys 0m16.572s > >>>>> > >>>>> 2.2.5 > >>>>> > >>>>> time ./bin/hbase pe --rows=5000000 --table=TestTable_f1_c20 > >>> --columns=20 --nomapred scan 2 > >>>>> real 1m31.016s > >>>>> user 0m48.658s > >>>>> sys 0m16.965s > >>>>> > >>>>> It didnt really make a difference if I used 1 thread or 4 or 10, the > >>> delta was about the same, proportionally. I picked two threads in the > end > >>> so I'd have enough time to launch async-profiler twice in another > shell, > >> to > >>> capture flamegraph and call tree output, respectively. async-profiler > >>> captured 10 seconds at steady state per test case. Upon first > inspection > >>> what jumps out is an increasing proportion of CPU time spent in GC in > >> 2.2.5 > >>> vs 1.6.0. The difference increases as the number of column families > >>> increase. There is little apparent difference at 1 column family, but a > >> 2x > >>> or more difference at 20 columns, and a 10x or more difference at 100 > >>> columns, eyeballing the charts, flipping back and forth between browser > >>> windows. This seems more than coincidental but obviously calls for > >> capture > >>> and analysis of GC trace, with JFR. Will do that next. > >>>>> > >>>>> JVM: openjdk version "1.8.0_232" OpenJDK Runtime Environment (Zulu > >>> 8.42.0.21-CA-macosx) (build 1.8.0_232-b18) OpenJDK 64-Bit Server VM > (Zulu > >>> 8.42.0.21-CA-macosx) (build 25.232-b18, mixed mode) > >>>>> > >>>>> Regionserver JVM flags: -Xms10g -Xmx10g -XX:+UseG1GC > >>> -XX:+AlwaysPreTouch -XX:+UseNUMA -XX:-UseBiasedLocking > >>> -XX:+ParallelRefProcEnabled > >>>>> > >>>>> > >>>>> On Thu, Jun 11, 2020 at 7:06 AM Jan Van Besien <ja...@ngdata.com> > >>> wrote: > >>>>>> This is promising, thanks a lot. Testing with hbase 2.2.5 shows an > >>>>>> improvement, but we're not there yet. > >>>>>> > >>>>>> As reported earlier, hbase 2.1.0 was about 60% slower than hbase > >> 1.2.0 > >>>>>> in a test that simply scans all the regions in parallel without any > >>>>>> filter. A test with hbase 2.2.5 shows it to be about 40% slower than > >>>>>> 1.2.0. So that is better than 2.1.0, but still substantially slower > >>>>>> than what hbase 1.2.0 was. > >>>>>> > >>>>>> As before, I tested this both on a 3 node cluster as well as with a > >>>>>> unit test using HBaseTestingUtility. Both tests show very similar > >>>>>> relative differences. > >>>>>> > >>>>>> Jan > >>>>>> > >>>>>> On Thu, Jun 11, 2020 at 2:16 PM Anoop John <anoop.hb...@gmail.com> > >>> wrote: > >>>>>>> > >>>>>>> In another mail thread Zheng Hu brought up an important Jra fix > >>>>>>> https://issues.apache.org/jira/browse/HBASE-21657 > >>>>>>> Can u pls check with this once? > >>>>>>> > >>>>>>> Anoop > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Jun 9, 2020 at 8:08 PM Jan Van Besien <ja...@ngdata.com> > >>> wrote: > >>>>>>> > >>>>>>>> On Sun, Jun 7, 2020 at 7:49 AM Anoop John <anoop.hb...@gmail.com > >>> > >>> wrote: > >>>>>>>>> As per the above configs, it looks like Bucket Cache is not > >>> being used. > >>>>>>>>> Only on heap LRU cache in use. > >>>>>>>> > >>>>>>>> True (but it is large enough to hold everything, so I don't think > >>> it > >>>>>>>> matters). > >>>>>>>> > >>>>>>>>> @Jan - Is it possible for you to test with off heap Bucket > >> Cache? > >>>>>>>> Config > >>>>>>>>> bucket cache off heap mode with size ~7.5 GB > >>>>>>>> > >>>>>>>> I did a quick test but it seems not to make a measurable > >>> difference. > >>>>>>>> If anything, it is actually slightly slower even. I see 100% hit > >>> ratio > >>>>>>>> in the L1 > >>>>>>>> LruBlockCache and effectively also 100% in the L2 BucketCache > >> (hit > >>>>>>>> ratio is not yet at 100% but hits increase with every test and > >>> misses > >>>>>>>> do not). > >>>>>>>> > >>>>>>>> Given that the LruBlockCache was already large enough to cache > >> all > >>> the > >>>>>>>> data anyway, I did not expect this to help either, to be honest. > >>>>>>>> > >>>>>>>>> Do you have any DataBlockEncoding enabled on the CF? > >>>>>>>> > >>>>>>>> Yes, FAST_DIFF. But this is of course true in both the tests with > >>>>>>>> hbase2 and hbase1. > >>>>>>>> > >>>>>>>> Jan > >>>>>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Best regards, > >>>>> Andrew > >>>>> > >>>>> Words like orphans lost among the crosstalk, meaning torn from > truth's > >>> decrepit hands > >>>>> - A23, Crosstalk > >>>> > >>>> > >>>> -- > >>>> Best regards, > >>>> Andrew > >>>> > >>>> Words like orphans lost among the crosstalk, meaning torn from truth's > >>> decrepit hands > >>>> - A23, Crosstalk > >>> > >> > > > > > -- Best regards, Andrew Words like orphans lost among the crosstalk, meaning torn from truth's decrepit hands - A23, Crosstalk