yashmayya opened a new pull request, #18113:
URL: https://github.com/apache/pinot/pull/18113

   ## Summary
   - **Leader coordination**: Only the lead controller runs 
`ResponseStoreCleaner` by gating `processTables()` on 
`isLeaderForTable(TASK_NAME)`, preventing all controllers from racing to delete 
the same expired responses on each broker.
   - **Graceful concurrent deletion on broker**: 
`AbstractResponseStore.deleteResponse()` now catches exceptions from 
`readResponse()` when files vanish between the `exists()` check and the read 
(TOCTOU race). `FsResponseStore.deleteResponseImpl()` catches exceptions from 
`pinotFS.delete()` and treats already-gone directories as success instead of 
throwing.
   - **No batch abort on individual failures**: 
`ResponseStoreCleaner.deleteExpiredResponses()` logs individual DELETE failures 
as warnings instead of throwing a `RuntimeException`, so one failed DELETE no 
longer aborts the entire broker's cleanup batch.
   
   ## Root cause
   When multiple controllers run the `ResponseStoreCleaner` concurrently (all 
controllers run it because `processTables()` ignores the table leadership 
list), they race to delete the same expired cursor responses on each broker. 
The broker's `deleteResponse()` has a TOCTOU race between `exists()` → 
`readResponse()` → `deleteResponseImpl()` — when one controller deletes a 
cursor's files, the others hit `FileNotFoundException` / `IOException`, and the 
broker returns HTTP 500 instead of 404. The controller's 
`deleteExpiredResponses()` then throws on any single 500, aborting the 
remaining successful deletes' logging for that broker.
   
   ## Test plan
   - [x] Existing `ResponseStoreCleanerTest` tests pass (including 
`testPartialBrokerFailureDoesNotBlockOthers` and 
`testCleanupTreats404AsSuccess`)
   - [ ] Verify in a multi-controller environment that only the lead controller 
runs the cleaner
   - [ ] Verify that concurrent DELETE requests to the broker no longer cause 
HTTP 500s
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to