[ 
https://issues.apache.org/jira/browse/SOLR-10205?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894229#comment-15894229
 ] 

Yonik Seeley commented on SOLR-10205:
-------------------------------------

Here are the results of some random test runs.

Each test run works by:
 # Building a random index
 # Running multiple query iterations.  Each iteration looks like:
    ## create a random list of queries
    ## shuffle a list of test configurations (in this case, different 
BlockCache configurations)
    ## run the set of queries with multiple query threads
    ## the results (like the time for the set of queries to run) is aggregated 
for each configuration across multiple iterations

Since these are random queries against a random index, results vary widely.
The number of iterations also varies widely (because sometimes running a single 
set of queries against a single configuration just once takes 5-10min).  The 
test times out after about 2 hrs.

The configurations for these first tests varied two parameters:
- when trying to find a free block, the number of times per loop we tried to 
call caffeine.cleanUp() if the map size >= maxEntries
- whether to use an executor for map cleanup (the default) or specify 
Runnable::run (current thread is used for any submitted tasks)

Some selected results:
{code}
55.7/47.2 = 18%
TEST=( time=532066 storeFails=384058 misses=4212598 executor=true 
cleanupTries=3 reserved=1) 
{size=1026,lookups=1216192323,hits=1166321317,evictions=39359922,storeFails=3800330,hitratio_current=0.96,lookups_persec=184208.98,hits_persec=176268.53,evictions_persec=6116.551,storeFails_persec=752.1843,time_delta=10.093270267,buffercache.allocations=0.0,buffercache.lost=0.0}
   [junit4]   2> iter=50 RESULTS:
   [junit4]   2>        ( time=557072 storeFails=1473124 misses=5151913 
executor=true cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=482894 storeFails=257414 misses=4090859 
executor=false cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=536910 storeFails=460293 misses=4275501 
executor=true cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=478086 storeFails=51078 misses=3934459 
executor=false cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=529849 storeFails=418456 misses=4238267 
executor=true cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=471511 storeFails=31507 misses=3918859 
executor=false cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=532066 storeFails=384058 misses=4212598 
executor=true cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=476136 storeFails=16950 misses=3900285 
executor=false cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=532881 storeFails=358717 misses=4188116 
executor=true cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=474018 storeFails=12556 misses=3893874 
executor=false cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=528446 storeFails=324929 misses=4162481 
executor=true cleanupTries=5 reserved=1)
   [junit4]   2>        ( time=476876 storeFails=11248 misses=3901980 
executor=false cleanupTries=5 reserved=1)

32threads 65.7/55.8 = 17%
   [junit4]   2> ###### Testing took 53.464 seconds. Total queries=1096 Total 
matches=124437449
   [junit4]   2> TEST=( time=657474 storeFails=9079465 misses=20447634 
executor=true cleanupTries=0 reserved=1) {size=1023,lookups=74615842
8,hits=537166437,evictions=163933036,storeFails=21041927,hitratio_current=0.67,lookups_persec=96833.06,hits_persec=64891.32,evictions_perse
c=15433.991,storeFails_persec=14261.533,time_delta=53.463745281,buffercache.allocations=0.0,buffercache.lost=0.0}
   [junit4]   2> iter=11 RESULTS:
   [junit4]   2>        ( time=657474 storeFails=9079465 misses=20447634 
executor=true cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=571034 storeFails=2082850 misses=17633145 
executor=false cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=617555 storeFails=2720476 misses=17615573 
executor=true cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=558536 storeFails=316721 misses=16978923 
executor=false cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=615668 storeFails=2258180 misses=17440287 
executor=true cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=560400 storeFails=75888 misses=16881946 
executor=false cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=612233 storeFails=1819890 misses=17284381 
executor=true cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=556029 storeFails=21173 misses=16840613 
executor=false cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=613037 storeFails=1476039 misses=17117779 
executor=true cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=556052 storeFails=10474 misses=16881301 
executor=false cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=611542 storeFails=1172765 misses=17019642 
executor=true cleanupTries=5 reserved=1)
   [junit4]   2>        ( time=554806 storeFails=8005 misses=16850296 
executor=false cleanupTries=5 reserved=1)
   [junit4]   2> ###### Starting 32 threads to test index.

56.5/44.9 = 26%
   [junit4]   2> TEST=( time=448935 storeFails=11805 misses=6577890 
executor=false cleanupTries=5 reserved=1) {size=1026,lookups=517573871,
hits=434273499,evictions=66427980,storeFails=8180252,hitratio_current=0.84,lookups_persec=98465.53,hits_persec=82995.516,evictions_persec=1
3616.733,storeFails_persec=26.554682,time_delta=110.338359367,buffercache.allocations=0.0,buffercache.lost=0.0}
   [junit4]   2> iter=3 RESULTS:
   [junit4]   2>        ( time=564653 storeFails=2763710 misses=8383855 
executor=true cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=476681 storeFails=703265 misses=7085675 
executor=false cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=538408 storeFails=1074397 misses=7120324 
executor=true cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=461370 storeFails=154650 misses=6712761 
executor=false cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=540369 storeFails=981297 misses=6973153 
executor=true cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=451366 storeFails=99016 misses=6643328 
executor=false cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=526381 storeFails=860927 misses=6900640 
executor=true cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=452509 storeFails=47727 misses=6630632 
executor=false cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=524049 storeFails=778166 misses=6851965 
executor=true cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=450483 storeFails=15949 misses=6601774 
executor=false cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=525711 storeFails=689340 misses=6809421 
executor=true cleanupTries=5 reserved=1)
   [junit4]   2>        ( time=448935 storeFails=11805 misses=6577890 
executor=false cleanupTries=5 reserved=1)
   [junit4]   2> ###### Starting 16 threads to test index.

68.3/47.9 = 43%
   [junit4]   2> TEST=( time=479176 storeFails=26366 misses=14968859 
executor=false cleanupTries=5 reserved=1) 
{size=1023,lookups=1080223338,hits=883635129,evictions=137110238
,storeFails=29268247,hitratio_current=0.83,lookups_persec=181366.17,hits_persec=149977.95,evictions_persec=26572.766,storeFails_persec=51.389427,time_delta=53.474033937,buffer
cache.allocations=0.0,buffercache.lost=0.0}
   [junit4]   2> iter=7 RESULTS:
   [junit4]   2>        ( time=682973 storeFails=13699954 misses=25335624 
executor=true cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=504946 storeFails=2825071 misses=17380035 
executor=false cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=617386 storeFails=3449095 misses=16408244 
executor=true cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=491423 storeFails=418291 misses=15648428 
executor=false cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=608662 storeFails=2843998 misses=15869874 
executor=true cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=484558 storeFails=108380 misses=15318461 
executor=false cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=597450 storeFails=2348699 misses=15456154 
executor=true cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=485222 storeFails=44793 misses=15301369 
executor=false cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=595625 storeFails=1913073 misses=15151969 
executor=true cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=485603 storeFails=30201 misses=15239200 
executor=false cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=579955 storeFails=1560324 misses=14494648 
executor=true cleanupTries=5 reserved=1)
   [junit4]   2>        ( time=479176 storeFails=26366 misses=14968859 
executor=false cleanupTries=5 reserved=1)
   [junit4]   2> ###### Starting 32 threads to test index.

71.1/67 = 6%
   [junit4]   2> TEST=( time=70218 storeFails=75701 misses=685973 executor=true 
cleanupTries=3 reserved=1) 
{size=1023,lookups=30157128,hits=22001616,evictions=6055373,storeFails=656528,hitratio_current=0.72,lookups_persec=30802.889,hits_persec=22328.277,evictions_persec=6049.655,storeFails_persec=1036.2858,time_delta=6.087124094,buffercache.allocations=0.0,buffercache.lost=0.0}
   [junit4]   2> iter=10 RESULTS:
   [junit4]   2>        ( time=71153 storeFails=227016 misses=755096 
executor=true cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=68022 storeFails=34926 misses=675982 
executor=false cleanupTries=0 reserved=1)
   [junit4]   2>        ( time=69589 storeFails=88782 misses=691750 
executor=true cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=67931 storeFails=8711 misses=653859 
executor=false cleanupTries=1 reserved=1)
   [junit4]   2>        ( time=69926 storeFails=80857 misses=689267 
executor=true cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=68225 storeFails=4527 misses=655974 
executor=false cleanupTries=2 reserved=1)
   [junit4]   2>        ( time=70218 storeFails=75701 misses=685973 
executor=true cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=66837 storeFails=2137 misses=655777 
executor=false cleanupTries=3 reserved=1)
   [junit4]   2>        ( time=69398 storeFails=71855 misses=693155 
executor=true cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=67337 storeFails=1822 misses=656747 
executor=false cleanupTries=4 reserved=1)
   [junit4]   2>        ( time=69073 storeFails=58549 misses=681520 
executor=true cleanupTries=5 reserved=1)
   [junit4]   2>        ( time=66978 storeFails=1645 misses=658712 
executor=false cleanupTries=5 reserved=1)
   [junit4]   2> ###### Starting 32 threads to test index.
{code}

I left out results with low iterations that resulted in high variability (need 
to fix test to try and account for this), as well as tests that resulted in a 
hitrate of 1 (boring).
Thoughts:
- not using an executor has the largest impact by far
- trying to call map.cleanUp() at least once when there are no free entries is 
a win
- the biggest performance gains tend to be when cache hit rates are relatively 
high (I guess when cache hit rates are low, the overall time is dominated by 
reading from HDFS)


> Evaluate and reduce BlockCache store failures
> ---------------------------------------------
>
>                 Key: SOLR-10205
>                 URL: https://issues.apache.org/jira/browse/SOLR-10205
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Yonik Seeley
>            Assignee: Yonik Seeley
>         Attachments: SOLR-10205.patch
>
>
> The BlockCache is written such that requests to cache a block 
> (BlockCache.store call) can fail, making caching less effective.  We should 
> evaluate the impact of this storage failure and potentially reduce the number 
> of storage failures.
> The implementation reserves a single block of memory.  In store, a block of 
> memory is allocated, and then a pointer is inserted into the underling map.  
> A block is only freed when the underlying map evicts the map entry.
> This means that when two store() operations are called concurrently (even 
> under low load), one can fail.  This is made worse by the fact that 
> concurrent maps typically tend to amortize the cost of eviction over many 
> keys (i.e. the actual size of the map can grow beyond the configured maximum 
> number of entries... both the older ConcurrentLinkedHashMap and newer 
> Caffeine do this).  When this is the case, store() won't be able to find a 
> free block of memory, even if there aren't any other concurrently operating 
> stores.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to