[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876872#comment-15876872 ] ASF subversion and git services commented on SOLR-10141: Commit d8799bc475ca5d384ec49ecf2726aec58e37447b in lucene-solr's branch refs/heads/branch_6x from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d8799bc ] SOLR-10141: Upgrade to Caffeine 2.4.0 to fix issues with removal listener > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876847#comment-15876847 ] ASF subversion and git services commented on SOLR-10141: Commit e9e02a2313518682690ca2933efd0b4db0b54b7c in lucene-solr's branch refs/heads/master from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=e9e02a2 ] SOLR-10141: Upgrade to Caffeine 2.4.0 to fix issues with removal listener > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873511#comment-15873511 ] Ben Manes commented on SOLR-10141: -- Released 2.4.0 > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873361#comment-15873361 ] Ben Manes commented on SOLR-10141: -- That makes sense. If its a fallback when an empty slot can't be acquired, it may be preferable to calling cleanUp() always. But a stress test would be necessary to verify that, as the spin time might be too small so that it didn't help. In most traces frequency dominates over recency, so most insertions are pollutants. The impact of a failed insertion might not have had a negative result, as a popular item would make its way in. Then the failing one-hit wonders wouldn't have disrupted the LRU as much. That's less meaningful with Caffeine, since we switched to TinyLFU. As an aside, I'd appreciate help in moving SOLR-8241 forward. Its been approved but backlogged as the committer has not had the time to actively participate in Solr. But if that's crossing territories or you feel uncomfortable due to this bug, I understand. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873357#comment-15873357 ] Yonik Seeley commented on SOLR-10141: - The size issue is only an issue for the BlockCache specifically (not for any other Solr caches). Actually, the way the BlockCache is written, we are guaranteed to never have more than maxEntries... writers have to wait for an open slot (which opens up once the removal listener is called). The writer spins a bit trying to find an open slot and fails if it can't. Doing extra work via cache.cleanUp() if we don't see an empty slot is definitely better than failing to cache the entry. I imagine the issue existed when CLHM was used as well. The metric of store failures isn't currently tracked, and it only leads to a lower cache hit rate. I plan on starting to track it, and then to see how often it happens when we're actually caching real HDFS blocks. That's a separate issue though. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873334#comment-15873334 ] Ben Manes commented on SOLR-10141: -- If you wish to ensure a very strict bounding by throttling writers, that would do the job. I'm not sure if its needed except in your tests, as in practice the assumption is its cleaned up in a timely enough manner. The cache uses a bounded write buffer to provide some slack, minimize the response latencies for writers, and defers the cleanup to the executor (scheduled as immediate). This allows the cache to temporarily exceed the high water mark, but catch up quickly. In general a high write rate on a cache is actually 2-3 inserts/sec, there's memory headroom for GC, and the server isn't cpu bounded. If instead we ensured a strict bound then we'd need a global lock to throttle writers on which limits concurrency. So its a trade-off that works for most usages. CLHM uses the same design, so I wonder if only your tests are affected but it is okay in practice. CLHM uses an unbounded write buffer, whereas in Caffeine its bounded to provide some back pressure if full. Being full is very rare, so this is mostly to replace linked lists with a growable ring buffer. The slack is probably excessive as I didn't have a good sizing parameter (max ~= 128 x ncpu). The cleanUp() call forces the caller to block and do the maintenance itself, rather than relying on the async processing (which may be in-flight or triggered on a subsequent operation). You can get a sense of this write-ahead log design from this [slide deck|https://docs.google.com/presentation/d/1NlDxyXsUG1qlVHMl4vsUUBQfAJ2c2NsFPNPr2qymIBs]. I'm not sure what, or if, I can do anything regarding your size concern. But I'll wait for releasing 2.4 until you're satisfied that we've resolved all the issues. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873210#comment-15873210 ] Yonik Seeley commented on SOLR-10141: - Thanks Ben, I confirmed that this fixes the removalListener issue. As far as the cache size issue, I've found that calling cache.cleanUp() after a put() seems to keep things under control. Is there any other method I should look at? {code} if (cache.estimatedSize() > maxEntries) { // BlockCache *really* relies on having enough removalListeners called to get back down to the configured maxEntries (otherwise the // underlying direct memory will be exhausted and the BlockCache.store will have to fail). cache.cleanUp(); } {code} > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15873011#comment-15873011 ] Ben Manes commented on SOLR-10141: -- [Pull Request|https://github.com/ben-manes/caffeine/pull/144] with the fix and your test case. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872983#comment-15872983 ] Ben Manes commented on SOLR-10141: -- Thanks!!! I think I found the bug. It now passes your test case. The problem was due to put() stampeding over the value during the eviction. The [eviction routine|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L725] performed the following: # Read the key, value, etc # Conditionally removed in a computeIfPresent() block - resurrected if a race occurred (e.g. was thought expired, but newly accessed) # Mark the entry as "dead" (using a synchronized (entry) block) # Notify the listener This failed because [putFast|https://github.com/ben-manes/caffeine/blob/65e3efd4b50613c27567ff594877d0f63acfbce2/caffeine/src/main/java/com/github/benmanes/caffeine/cache/BoundedLocalCache.java#L1521] can perform its update outside of a hash table lock (e.g. a computation). It synchronizes on the entry to update, checking first if it was still alive. This resulted in a race where the entry was removed from the hash table, the value updated, and entry marked as dead. When the listener was notified, it received the wrong value. The solution I have now is to expand the synchronized block on eviction. This passes your test and should be cheap. I'd like to review it a little more and incorporate your test into my suite. This is an excellent find. I've stared at the code many times and the race seems obvious in hindsight. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872969#comment-15872969 ] Ben Manes commented on SOLR-10141: -- Thanks! I'm resolving some issues with the latest error-prone (static analyzer) and dig into it. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872965#comment-15872965 ] Yonik Seeley commented on SOLR-10141: - I checked in the test (test method testCacheConcurrent) : https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;a=blob;f=solr/core/src/test/org/apache/solr/store/blockcache/BlockCacheTest.java > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872943#comment-15872943 ] Ben Manes commented on SOLR-10141: -- Can you provide me with the latest version of a self-contained test? If I can reproduce and debug it, I'll have a fix over the weekend. v2 introduced a new eviction policy to take into account the frequency. The eviction should be rapid, so these issues remaining are surprising. I've tried to be diligent about testing, so will investigate. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872937#comment-15872937 ] Yonik Seeley commented on SOLR-10141: - Well darn... it looked like things were fixed by the upgrade to 2.3.5, but then I looked a little closer. I happened to notice that the hit rate was super high, when I designed the test to be closer to 50% (maxEntries = maxBlocks/2) When I set these parameters in the test: {code} final int readLastBlockOdds=0; // odds (1 in N) of the next block operation being on the same block as the previous operation... helps flush concurrency issues final boolean updateAnyway = false; // sometimes insert a new entry for the key even if one was found {code} Results in something like this: {code} Done! # of Elements = 200 inserts=17234 removals=17034 hits=9982766 maxObservedSize=401 {code} So for 10M multi-threaded reads, our hit rate was 99.8%, which artificially lowers the rate at which we insert new entries, and hence doesn't exercise the concurrency as well, leading to a passing test most of the time. When I modified the test to increase the write concurrency again, accounting for a cache that is apparently too big: {code} final int readLastBlockOdds=10; // odds (1 in N) of the next block operation being on the same block as the previous operation... helps flush concurrency issues final boolean updateAnyway = true; // sometimes insert a new entry for the key even if one was found {code} The removal listener issues reappear: {code} WARNING: Exception thrown by removal listener java.lang.RuntimeException: listener called more than once! k=103 v=org.apache.solr.store.blockcache.BlockCacheTest$Val@49dbc210 removalCause=SIZE at org.apache.solr.store.blockcache.BlockCacheTest.lambda$testCacheConcurrent$0(BlockCacheTest.java:250) at org.apache.solr.store.blockcache.BlockCacheTest$$Lambda$5/498475569.onRemoval(Unknown Source) at com.github.benmanes.caffeine.cache.BoundedLocalCache.lambda$notifyRemoval$1(BoundedLocalCache.java:286) at com.github.benmanes.caffeine.cache.BoundedLocalCache$$Lambda$12/1297599052.run(Unknown Source) at org.apache.solr.store.blockcache.BlockCacheTest$$Lambda$7/957914685.execute(Unknown Source) {code} Guarding against the removal listener being called more than once with the same entry also doesn't seem to work (same as before) since it then becomes apparent that some entries never get passed to the removal listener. Even if the removal listener issues are fixed, the fact that the cache can be bigger than the configured size is a problem for us. The map itself is not storing the data, only controlling access to direct memory, so timely removal (and a timely call to the removal listener) under heavy concurrency is critical. Without that, the cache will cease to function as a LRU cache under load because we won't be able to find a free block int he direct memory to actually use. Even with only 2 threads, I see the cache going to at least double the configured maxEntries. Is there a way to configure the size checking to be more strict? > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872906#comment-15872906 ] ASF subversion and git services commented on SOLR-10141: Commit d810edf5e900bef32b10928d275a02c093d359b6 in lucene-solr's branch refs/heads/branch_6x from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=d810edf ] SOLR-10141: add test for underlying cache > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872905#comment-15872905 ] ASF subversion and git services commented on SOLR-10141: Commit 33e398c02115c57ea54bda5f6f612f1b06c1e771 in lucene-solr's branch refs/heads/master from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=33e398c ] SOLR-10141: add test for underlying cache > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872344#comment-15872344 ] ASF subversion and git services commented on SOLR-10141: Commit be61c6634872435614ea4d59fd14df3426398116 in lucene-solr's branch refs/heads/branch_6x from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=be61c66 ] SOLR-10141: Upgrade to Caffeine 2.3.5 to fix issues with removal listener > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872221#comment-15872221 ] Ben Manes commented on SOLR-10141: -- Thanks [~ysee...@gmail.com]. Sorry about any frustrations this caused. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15872211#comment-15872211 ] ASF subversion and git services commented on SOLR-10141: Commit 6804f3694210ac34728dd6f1a74736681dae2837 in lucene-solr's branch refs/heads/master from [~yo...@apache.org] [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=6804f36 ] SOLR-10141: Upgrade to Caffeine 2.3.5 to fix issues with removal listener > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch, Solr10141Test.java > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868205#comment-15868205 ] Ben Manes commented on SOLR-10141: -- Running your test against master and it doesn't fail. Can you please try Caffeine 2.3.5? The only change needed is that the RemovalListener is now lambda friendly. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868168#comment-15868168 ] Ben Manes commented on SOLR-10141: -- Oh, also older jdk8 versions had a bug in fjp causing it to drop tasks. That's also a possibility at play. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868161#comment-15868161 ] Ben Manes commented on SOLR-10141: -- I plan on porting the test to Caffeine's suite and checking against 2.x. Just waiting for my train to start. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868132#comment-15868132 ] Yonik Seeley commented on SOLR-10141: - Adding a guard in the test code is easy enough (just check if "live" has already been set to false), but that then causes an additional problem: a memory leak since size() != (adds-removes) at the end (i.e. the removal listener is not called for all items). It looks like the removal listener is called the correct number of times, but not always with the correct value. My guess is that it's somehow related to concurrent use of equal keys with different values. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > Attachments: SOLR-10141.patch > > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868110#comment-15868110 ] Ben Manes commented on SOLR-10141: -- It may be FJP retrying a task if it is slow to complete. If so, we might need to put a guard to ignore multiple attempts. I can help when you have a test case to investigate with. > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-10141) Caffeine cache causes BlockCache corruption
[ https://issues.apache.org/jira/browse/SOLR-10141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868099#comment-15868099 ] Yonik Seeley commented on SOLR-10141: - OK, so I finally tracked down the corruption failures with Caffeine to the removal listener being called more than once with the same value. The first time, the underlying block is released and then presumably reused for a different key. The next time (which should never happen), the underlying block is unlocked again and can hence be reused by an additional key and we get into a situation where multiple "live" keys point to the same underlying memory block (and corruption results). > Caffeine cache causes BlockCache corruption > > > Key: SOLR-10141 > URL: https://issues.apache.org/jira/browse/SOLR-10141 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Yonik Seeley > > After fixing the race conditions in the BlockCache itself (SOLR-10121), the > concurrency test passes with the previous implementation using > ConcurrentLinkedHashMap and fail with Caffeine. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org