kfaraz commented on code in PR #14616: URL: https://github.com/apache/druid/pull/14616#discussion_r1273012308
########## server/src/main/java/org/apache/druid/server/coordination/SegmentLoadDropHandler.java: ########## @@ -90,19 +93,46 @@ public class SegmentLoadDropHandler implements DataSegmentChangeHandler private final SegmentManager segmentManager; private final ScheduledExecutorService exec; private final ServerTypeConfig serverTypeConfig; - private final ConcurrentSkipListSet<DataSegment> segmentsToDelete; + private final ConcurrentSkipListSet<DataSegment> segmentsToDrop; private final SegmentCacheManager segmentCacheManager; private volatile boolean started = false; - // Keep history of load/drop request status in a LRU cache to maintain idempotency if same request shows up - // again and to return status of a completed request. Maximum size of this cache must be significantly greater - // than number of pending load/drop requests. so that history is not lost too quickly. - private final Cache<DataSegmentChangeRequest, AtomicReference<Status>> requestStatuses; + /** + * Used to cache the status of a completed load or drop request until it has + * been served to the (Coordinator) client exactly once. + * <p> + * The cache is used as follows: + * <ol> + * <li>An entry with state PENDING is added to the cache upon receiving a + * request to load or drop a segment.</li> + * <li>A duplicate request received at this point is immediately answered with PENDING.</li> + * <li>Once the load/drop finishes, the entry is updated to either SUCCESS or FAILED.</li> + * <li>A duplicate request received at this point is immediately answered with + * SUCCESS or FAILED and the entry is removed from the cache.</li> + * <li>If the first request itself finishes after the load or drop has already + * completed, it is answered with a SUCCESS or FAILED and the entry is removed + * from the cache.</li> + * </ol> + * <p> + * Maximum size of this cache must be significantly greater than the number of Review Comment: If the Coordinator sends a batch larger than the cache size, the Historical would not be able to have an entry for some of the requests in the cache. These requests would still be processed but their finished statuses would never be updated in the cache and thus never sent back to the Coordinator. The Coordinator would end up retrying these requests and the Historical might duplicate some work but there would never be any correctness issue, since we are not sending any response to the Coordinator for the un-cached requests. Given that the default cache size is 100 and default request batch size is 1, we should be okay. We can update the docs to mention that the cache size must be at least 10x of the batch size. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
