[
https://issues.apache.org/jira/browse/SOLR-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894229#comment-15894229
]
Yonik Seeley commented on SOLR-10205:
-------------------------------------
Here are the results of some random test runs.
Each test run works by:
# Building a random index
# Running multiple query iterations. Each iteration looks like:
## create a random list of queries
## shuffle a list of test configurations (in this case, different
BlockCache configurations)
## run the set of queries with multiple query threads
## the results (like the time for the set of queries to run) is aggregated
for each configuration across multiple iterations
Since these are random queries against a random index, results vary widely.
The number of iterations also varies widely (because sometimes running a single
set of queries against a single configuration just once takes 5-10min). The
test times out after about 2 hrs.
The configurations for these first tests varied two parameters:
- when trying to find a free block, the number of times per loop we tried to
call caffeine.cleanUp() if the map size >= maxEntries
- whether to use an executor for map cleanup (the default) or specify
Runnable::run (current thread is used for any submitted tasks)
Some selected results:
{code}
55.7/47.2 = 18%
TEST=( time=532066 storeFails=384058 misses=4212598 executor=true
cleanupTries=3 reserved=1)
{size=1026,lookups=1216192323,hits=1166321317,evictions=39359922,storeFails=3800330,hitratio_current=0.96,lookups_persec=184208.98,hits_persec=176268.53,evictions_persec=6116.551,storeFails_persec=752.1843,time_delta=10.093270267,buffercache.allocations=0.0,buffercache.lost=0.0}
[junit4] 2> iter=50 RESULTS:
[junit4] 2> ( time=557072 storeFails=1473124 misses=5151913
executor=true cleanupTries=0 reserved=1)
[junit4] 2> ( time=482894 storeFails=257414 misses=4090859
executor=false cleanupTries=0 reserved=1)
[junit4] 2> ( time=536910 storeFails=460293 misses=4275501
executor=true cleanupTries=1 reserved=1)
[junit4] 2> ( time=478086 storeFails=51078 misses=3934459
executor=false cleanupTries=1 reserved=1)
[junit4] 2> ( time=529849 storeFails=418456 misses=4238267
executor=true cleanupTries=2 reserved=1)
[junit4] 2> ( time=471511 storeFails=31507 misses=3918859
executor=false cleanupTries=2 reserved=1)
[junit4] 2> ( time=532066 storeFails=384058 misses=4212598
executor=true cleanupTries=3 reserved=1)
[junit4] 2> ( time=476136 storeFails=16950 misses=3900285
executor=false cleanupTries=3 reserved=1)
[junit4] 2> ( time=532881 storeFails=358717 misses=4188116
executor=true cleanupTries=4 reserved=1)
[junit4] 2> ( time=474018 storeFails=12556 misses=3893874
executor=false cleanupTries=4 reserved=1)
[junit4] 2> ( time=528446 storeFails=324929 misses=4162481
executor=true cleanupTries=5 reserved=1)
[junit4] 2> ( time=476876 storeFails=11248 misses=3901980
executor=false cleanupTries=5 reserved=1)
32threads 65.7/55.8 = 17%
[junit4] 2> ###### Testing took 53.464 seconds. Total queries=1096 Total
matches=124437449
[junit4] 2> TEST=( time=657474 storeFails=9079465 misses=20447634
executor=true cleanupTries=0 reserved=1) {size=1023,lookups=74615842
8,hits=537166437,evictions=163933036,storeFails=21041927,hitratio_current=0.67,lookups_persec=96833.06,hits_persec=64891.32,evictions_perse
c=15433.991,storeFails_persec=14261.533,time_delta=53.463745281,buffercache.allocations=0.0,buffercache.lost=0.0}
[junit4] 2> iter=11 RESULTS:
[junit4] 2> ( time=657474 storeFails=9079465 misses=20447634
executor=true cleanupTries=0 reserved=1)
[junit4] 2> ( time=571034 storeFails=2082850 misses=17633145
executor=false cleanupTries=0 reserved=1)
[junit4] 2> ( time=617555 storeFails=2720476 misses=17615573
executor=true cleanupTries=1 reserved=1)
[junit4] 2> ( time=558536 storeFails=316721 misses=16978923
executor=false cleanupTries=1 reserved=1)
[junit4] 2> ( time=615668 storeFails=2258180 misses=17440287
executor=true cleanupTries=2 reserved=1)
[junit4] 2> ( time=560400 storeFails=75888 misses=16881946
executor=false cleanupTries=2 reserved=1)
[junit4] 2> ( time=612233 storeFails=1819890 misses=17284381
executor=true cleanupTries=3 reserved=1)
[junit4] 2> ( time=556029 storeFails=21173 misses=16840613
executor=false cleanupTries=3 reserved=1)
[junit4] 2> ( time=613037 storeFails=1476039 misses=17117779
executor=true cleanupTries=4 reserved=1)
[junit4] 2> ( time=556052 storeFails=10474 misses=16881301
executor=false cleanupTries=4 reserved=1)
[junit4] 2> ( time=611542 storeFails=1172765 misses=17019642
executor=true cleanupTries=5 reserved=1)
[junit4] 2> ( time=554806 storeFails=8005 misses=16850296
executor=false cleanupTries=5 reserved=1)
[junit4] 2> ###### Starting 32 threads to test index.
56.5/44.9 = 26%
[junit4] 2> TEST=( time=448935 storeFails=11805 misses=6577890
executor=false cleanupTries=5 reserved=1) {size=1026,lookups=517573871,
hits=434273499,evictions=66427980,storeFails=8180252,hitratio_current=0.84,lookups_persec=98465.53,hits_persec=82995.516,evictions_persec=1
3616.733,storeFails_persec=26.554682,time_delta=110.338359367,buffercache.allocations=0.0,buffercache.lost=0.0}
[junit4] 2> iter=3 RESULTS:
[junit4] 2> ( time=564653 storeFails=2763710 misses=8383855
executor=true cleanupTries=0 reserved=1)
[junit4] 2> ( time=476681 storeFails=703265 misses=7085675
executor=false cleanupTries=0 reserved=1)
[junit4] 2> ( time=538408 storeFails=1074397 misses=7120324
executor=true cleanupTries=1 reserved=1)
[junit4] 2> ( time=461370 storeFails=154650 misses=6712761
executor=false cleanupTries=1 reserved=1)
[junit4] 2> ( time=540369 storeFails=981297 misses=6973153
executor=true cleanupTries=2 reserved=1)
[junit4] 2> ( time=451366 storeFails=99016 misses=6643328
executor=false cleanupTries=2 reserved=1)
[junit4] 2> ( time=526381 storeFails=860927 misses=6900640
executor=true cleanupTries=3 reserved=1)
[junit4] 2> ( time=452509 storeFails=47727 misses=6630632
executor=false cleanupTries=3 reserved=1)
[junit4] 2> ( time=524049 storeFails=778166 misses=6851965
executor=true cleanupTries=4 reserved=1)
[junit4] 2> ( time=450483 storeFails=15949 misses=6601774
executor=false cleanupTries=4 reserved=1)
[junit4] 2> ( time=525711 storeFails=689340 misses=6809421
executor=true cleanupTries=5 reserved=1)
[junit4] 2> ( time=448935 storeFails=11805 misses=6577890
executor=false cleanupTries=5 reserved=1)
[junit4] 2> ###### Starting 16 threads to test index.
68.3/47.9 = 43%
[junit4] 2> TEST=( time=479176 storeFails=26366 misses=14968859
executor=false cleanupTries=5 reserved=1)
{size=1023,lookups=1080223338,hits=883635129,evictions=137110238
,storeFails=29268247,hitratio_current=0.83,lookups_persec=181366.17,hits_persec=149977.95,evictions_persec=26572.766,storeFails_persec=51.389427,time_delta=53.474033937,buffer
cache.allocations=0.0,buffercache.lost=0.0}
[junit4] 2> iter=7 RESULTS:
[junit4] 2> ( time=682973 storeFails=13699954 misses=25335624
executor=true cleanupTries=0 reserved=1)
[junit4] 2> ( time=504946 storeFails=2825071 misses=17380035
executor=false cleanupTries=0 reserved=1)
[junit4] 2> ( time=617386 storeFails=3449095 misses=16408244
executor=true cleanupTries=1 reserved=1)
[junit4] 2> ( time=491423 storeFails=418291 misses=15648428
executor=false cleanupTries=1 reserved=1)
[junit4] 2> ( time=608662 storeFails=2843998 misses=15869874
executor=true cleanupTries=2 reserved=1)
[junit4] 2> ( time=484558 storeFails=108380 misses=15318461
executor=false cleanupTries=2 reserved=1)
[junit4] 2> ( time=597450 storeFails=2348699 misses=15456154
executor=true cleanupTries=3 reserved=1)
[junit4] 2> ( time=485222 storeFails=44793 misses=15301369
executor=false cleanupTries=3 reserved=1)
[junit4] 2> ( time=595625 storeFails=1913073 misses=15151969
executor=true cleanupTries=4 reserved=1)
[junit4] 2> ( time=485603 storeFails=30201 misses=15239200
executor=false cleanupTries=4 reserved=1)
[junit4] 2> ( time=579955 storeFails=1560324 misses=14494648
executor=true cleanupTries=5 reserved=1)
[junit4] 2> ( time=479176 storeFails=26366 misses=14968859
executor=false cleanupTries=5 reserved=1)
[junit4] 2> ###### Starting 32 threads to test index.
71.1/67 = 6%
[junit4] 2> TEST=( time=70218 storeFails=75701 misses=685973 executor=true
cleanupTries=3 reserved=1)
{size=1023,lookups=30157128,hits=22001616,evictions=6055373,storeFails=656528,hitratio_current=0.72,lookups_persec=30802.889,hits_persec=22328.277,evictions_persec=6049.655,storeFails_persec=1036.2858,time_delta=6.087124094,buffercache.allocations=0.0,buffercache.lost=0.0}
[junit4] 2> iter=10 RESULTS:
[junit4] 2> ( time=71153 storeFails=227016 misses=755096
executor=true cleanupTries=0 reserved=1)
[junit4] 2> ( time=68022 storeFails=34926 misses=675982
executor=false cleanupTries=0 reserved=1)
[junit4] 2> ( time=69589 storeFails=88782 misses=691750
executor=true cleanupTries=1 reserved=1)
[junit4] 2> ( time=67931 storeFails=8711 misses=653859
executor=false cleanupTries=1 reserved=1)
[junit4] 2> ( time=69926 storeFails=80857 misses=689267
executor=true cleanupTries=2 reserved=1)
[junit4] 2> ( time=68225 storeFails=4527 misses=655974
executor=false cleanupTries=2 reserved=1)
[junit4] 2> ( time=70218 storeFails=75701 misses=685973
executor=true cleanupTries=3 reserved=1)
[junit4] 2> ( time=66837 storeFails=2137 misses=655777
executor=false cleanupTries=3 reserved=1)
[junit4] 2> ( time=69398 storeFails=71855 misses=693155
executor=true cleanupTries=4 reserved=1)
[junit4] 2> ( time=67337 storeFails=1822 misses=656747
executor=false cleanupTries=4 reserved=1)
[junit4] 2> ( time=69073 storeFails=58549 misses=681520
executor=true cleanupTries=5 reserved=1)
[junit4] 2> ( time=66978 storeFails=1645 misses=658712
executor=false cleanupTries=5 reserved=1)
[junit4] 2> ###### Starting 32 threads to test index.
{code}
I left out results with low iterations that resulted in high variability (need
to fix test to try and account for this), as well as tests that resulted in a
hitrate of 1 (boring).
Thoughts:
- not using an executor has the largest impact by far
- trying to call map.cleanUp() at least once when there are no free entries is
a win
- the biggest performance gains tend to be when cache hit rates are relatively
high (I guess when cache hit rates are low, the overall time is dominated by
reading from HDFS)
> Evaluate and reduce BlockCache store failures
> ---------------------------------------------
>
> Key: SOLR-10205
> URL: https://issues.apache.org/jira/browse/SOLR-10205
> Project: Solr
> Issue Type: Improvement
> Security Level: Public(Default Security Level. Issues are Public)
> Reporter: Yonik Seeley
> Assignee: Yonik Seeley
> Attachments: SOLR-10205.patch
>
>
> The BlockCache is written such that requests to cache a block
> (BlockCache.store call) can fail, making caching less effective. We should
> evaluate the impact of this storage failure and potentially reduce the number
> of storage failures.
> The implementation reserves a single block of memory. In store, a block of
> memory is allocated, and then a pointer is inserted into the underling map.
> A block is only freed when the underlying map evicts the map entry.
> This means that when two store() operations are called concurrently (even
> under low load), one can fail. This is made worse by the fact that
> concurrent maps typically tend to amortize the cost of eviction over many
> keys (i.e. the actual size of the map can grow beyond the configured maximum
> number of entries... both the older ConcurrentLinkedHashMap and newer
> Caffeine do this). When this is the case, store() won't be able to find a
> free block of memory, even if there aren't any other concurrently operating
> stores.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]