Re: [PR] Do not block task actions on Overlord if segment metadata cache is syncing (druid)

via GitHub Thu, 27 Mar 2025 13:39:25 -0700


kfaraz commented on code in PR #17824:
URL: https://github.com/apache/druid/pull/17824#discussion_r2017556232



##########
server/src/main/java/org/apache/druid/metadata/segment/cache/HeapMemorySegmentMetadataCache.java:
##########
@@ -71,10 +71,20 @@
 /**
  * In-memory implementation of {@link SegmentMetadataCache}.
  * <p>
- * Reads can be done from the cache only after it has synced with the metadata 
store.
- * Writes can happen even when the sync is in progress. This is because the
- * {@link #syncWithMetadataStore()} is able to reconcile differences based on
- * the update time of a transaction.
+ * Only used segments (excluding num_rows and schema_fingerpring) and
+ * pending segments are cached. Unused segments are not cached.
+ * <p>
+ * Non-leader Overlords also keep polling the metadata store to keep the cache
+ * up-to-date in case leadership changes.
+ * <p>
+ * Cache usage modes: {@link UsageMode}:
+ * <p>
+ * The map {@link #datasourceToSegmentCache} contains the cache for each 
datasource.

Review Comment:
   Yeah, I had kind of left it for a later PR. But for the sake of completeness 
of this feature, we should probably address it sooner rather than later.
   
   I don't think concurrentmap is really the problem. In fact, it helps keep 
the design fairly clean and highly concurrent. The problem is that we have 
given out a datasource cache instance but it is yet to be locked
   by a transaction for read or write. So the cleanup thread doesn't know if a 
transaction is just about to start
   working on it.
   
   I had a slightly different approach in mind to solve this problem. It would 
work with something like permissions.
   
   When a transaction makes a call to `getCacheForDatasource()`, it gets the 
cache instance for that datasource. At this point, the permission count of this 
cache will be incremented.
   Then, at some point soon after, this transaction acquires an actual read or 
write lock on this cache instance and does its thing.
   Finally, the transaction releases the read or write lock which also 
decrements the permissions count.
   
   The cache for a datasource can be completely deleted only when its 
permission count is 0.
   
   This ensures that if a transaction has gotten an instance of a datasource 
cache, it cannot be cleaned up even if the cache is not currently locked.
   
   Please let me know what you think. I also feel that maybe this fix should go 
in a separate PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Do not block task actions on Overlord if segment metadata cache is syncing (druid)

Reply via email to