hanahmily opened a new pull request, #1152:
URL: https://github.com/apache/skywalking-banyandb/pull/1152

   ## Problem
   
   The scheduled **Flaky Test** job ([run 
26878691166](https://github.com/apache/skywalking-banyandb/actions/runs/26878691166/job/79277978234))
 failed with two `banyand/measure` tests:
   
   - `TestSnapshotFunctionality` → `expected 1 directory for flushed part, got 
0` (`snapshot_test.go:491`)
   - `TestMeasureInitTSTableDeletesMultipleFailedSnapshotsOnFallback` → `"0" is 
not greater than or equal to "1"` (`snapshot_test.go:658`)
   
   ## Root cause
   
   Both tests used `require.Eventually` waiting for a **part directory** to 
appear in `tab/` as the "flush is done" gate. But that directory is created by 
the **first line** of `memPart.mustFlush` (`MkdirPanicIfExist`, `part.go:208`) 
— *before* the mem→file introduction reaches the in-memory snapshot and 
*before* the `.snp` manifest is persisted (`introduceFlushed` → 
`persistSnapshot`). So the gate fires early, and under `-race`/CI load:
   
   - `TestSnapshotFunctionality`: `TakeFileSnapshot` skips mem parts and 
returns `(false, nil)` when there are no disk parts; the test ignored `success` 
and asserted on entries → **0**.
   - `...DeletesMultipleFailedSnapshotsOnFallback`: `Close()` cancels the 
loopCloser; the introducer shutdown drain uses a non-blocking `default`, so an 
in-flight flush whose `flushCh <- ind` hasn't been received is dropped → no 
`.snp` persisted → **0**.
   
   The two PANIC lines in the CI log are unrelated noise: `cannot parse ...snp: 
unexpected end of JSON input` is the deliberate `require.Panics` in 
`TestMustReadSnapshotPanicsOnError`; `directory is exist` is a 
panicdiag-recovered flusher panic.
   
   ## Fix
   
   Add `waitForPersistedSnapshot`, which gates on the persisted `.snp` manifest 
existing instead of the part directory, and apply it to the four tests sharing 
this pattern (`TestSnapshotFunctionality`, 
`TestTolerantLoaderFallbackToOlderSnapshot`, 
`TestMeasureInitTSTableDeletesMultipleFailedSnapshotsOnFallback`, 
`TestInitTSTableLoadsNewestWhenMultipleValid`). Also assert `success == true` 
from `TakeFileSnapshot`, and drop the now-redundant `time.Sleep(100ms)` calls.
   
   This is a test-only change; no production behavior is modified.
   
   ## Verification
   
   - Reproduced the failure deterministically (100%) by removing the masking 
sleep: the in-memory snapshot held `memParts=2, fileParts=0` while the part 
directory already existed on disk.
   - After the fix: **300/300** green under `-race` for the four tests 
(previously ~1/60 flaky).
   - Local CI: `make license-check / check-req / build / lint / check` all 
pass; `make test-ci PKG=./banyand/...` (ginkgo `--race`) — Test Suite Passed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to