[ https://issues.apache.org/jira/browse/KAFKA-15481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770830#comment-17770830 ]
Arpit Goyal commented on KAFKA-15481: ------------------------------------- [~divijvaidya] [~showuon] 1\ we want to ensure here is that removing the entry from cache and renaming the file to add "delete" suffix should be atomic operations OR 2\ we can add a UUID at the end of each index file downloaded from remote for this cache so that each "entry" in the cache has a unique file associated with it. Should we go with the first or the 2nd approach , IMO 2nd would decouple the flow and would be simple to understand ? > Concurrency bug in RemoteIndexCache leads to IOException > -------------------------------------------------------- > > Key: KAFKA-15481 > URL: https://issues.apache.org/jira/browse/KAFKA-15481 > Project: Kafka > Issue Type: Bug > Affects Versions: 3.6.0 > Reporter: Divij Vaidya > Priority: Major > Fix For: 3.7.0 > > > RemoteIndexCache has a concurrency bug which leads to IOException while > fetching data from remote tier. > Below events in order of timeline - > Thread 1 (cache thread): invalidates the entry, removalListener is invoked > async, so the files have not been renamed to "deleted" suffix yet. > Thread 2: (fetch thread): tries to find entry in cache, doesn't find it > because it has been removed by 1, fetches the entry from S3, writes it to > existing file (using replace existing) > Thread 1: async removalListener is invoked, acquires a lock on old entry > (which has been removed from cache), it renames the file to "deleted" and > starts deleting it > Thread 2: Tries to create in-memory/mmapped index, but doesn't find the file > and hence, creates a new file of size 2GB in AbstractIndex constructor. JVM > returns an error as it won't allow creation of 2GB random access file. > *Potential Fix* > Use EvictionListener instead of RemovalListener in Caffeine cache as per the > documentation: > {quote} When the operation must be performed synchronously with eviction, use > {{Caffeine.evictionListener(RemovalListener)}} instead. This listener will > only be notified when {{RemovalCause.wasEvicted()}} is true. For an explicit > removal, {{Cache.asMap()}} offers compute methods that are performed > atomically.{quote} > This will ensure that removal from cache and marking the file with delete > suffix is synchronously done, hence the above race condition will not occur. -- This message was sent by Atlassian Jira (v8.20.10#820010)