Re: [PR] [improvement](fe) Add configurable interval for catalog recycle bin daemon thread [doris]

via GitHub Wed, 03 Jun 2026 02:21:39 -0700


morningman commented on PR #63311:
URL: https://github.com/apache/doris/pull/63311#issuecomment-4610856871


   ## 1. Make the test fail on the unfixed code
   
   This is the most important point. The new `testEraseUsesFreshCurrentTime` 
**passes against both the buggy and the fixed code**, so it does not actually 
verify this PR.
   
   Why: the test calls `Thread.sleep(1500)` *before* invoking 
`runAfterCatalogReady()`. The recycled table and database get `recycleTime ≈ 
now` at setup, and `catalog_trash_expire_second` is set to 1s. By the time 
`runAfterCatalogReady()` runs and captures its timestamp — even the *stale* one 
captured at method entry in the old code — roughly 1500ms have already elapsed:
   
   ```
   latency = currentTimeMs(stale, at method entry) - recycleTime ≈ 1500ms > 
expire(1000ms)  → already expired
   ```
   
   So the old shared-timestamp code also sees the table/database as expired and 
erases them; every assertion still passes. This bug only manifests when an item 
crosses the expiry threshold *during* the slow `erasePartition` call — a 
condition this test never creates, because it crosses the threshold *before* 
the daemon method even starts.
   
   > Verification the author can run: revert the production change back to the 
shared `currentTimeMs` and run only this test — it still passes. A test that 
cannot fail when the business logic is broken does not prove the fix works.
   
   **Recommended fix — abstract the time source so the test is deterministic** 
(no wall-clock dependence, no need for 1000 partitions):
   
   ```java
   // Production side: make the clock injectable (illustrative)
   private LongSupplier clock = System::currentTimeMillis;
   
   @VisibleForTesting
   void setClock(LongSupplier clock) { this.clock = clock; }
   
   protected void runAfterCatalogReady() {
       int keepNum = Config.max_same_name_catalog_trash_num;
       erasePartition(clock.getAsLong(), keepNum);
       eraseTable(clock.getAsLong(), keepNum);
       eraseDatabase(clock.getAsLong(), keepNum);
   }
   ```
   
   In the test, drive the clock so the second/third reads return a value 
noticeably larger than the first, and set the table/database `recycleTime` so 
they are **not expired at the first read but expired at the later reads**. That 
deterministically distinguishes fixed from broken behavior.
   
   If changing the production signature is undesirable, the alternative is to 
construct the real "threshold crossed *during* `erasePartition`" condition 
(remove the pre-sleep, and make `erasePartition` genuinely take longer than the 
expiry window). This path is inherently flaky (see #2) and is not recommended.
   
   ## 2. [Should fix] Wall-clock–based timing makes the test flaky even once 
the logic is corrected
   
   The test assumes "erasing 1000 partitions is slow enough to cross the expiry 
threshold." In a unit test, erasing 1000 **in-memory** partitions will very 
likely complete well under the 1-second threshold, so the "slow operation" 
premise may not hold — making the test both ineffective and dependent on 
machine/load. Timing bugs tested against a real wall clock are fragile by 
nature. Prefer abstracting the time source (a controllable fake clock, as in 
#1) so the outcome is deterministic.
   
   ## 3. [Should fix] Test pollution — restore 
`Config.max_same_name_catalog_trash_num`
   
   The test sets `Config.max_same_name_catalog_trash_num = -1` *outside* the 
`try` block and never restores it. The default is `3`, and it is a static 
global, so leaving it at `-1` can cause order-dependent failures in other tests 
sharing the JVM. The `finally` already restores the other two configs — add 
this one too:
   
   ```java
   int origMaxSameName = Config.max_same_name_catalog_trash_num;
   Config.max_same_name_catalog_trash_num = -1;
   ...
   } finally {
       Config.max_same_name_catalog_trash_num = origMaxSameName;
       Config.catalog_trash_expire_second = origExpireSecond;
       Config.catalog_trash_ignore_min_erase_latency = origIgnoreMinErase;
   }
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [improvement](fe) Add configurable interval for catalog recycle bin daemon thread [doris]

Reply via email to