hanahmily opened a new pull request, #1152: URL: https://github.com/apache/skywalking-banyandb/pull/1152
## Problem The scheduled **Flaky Test** job ([run 26878691166](https://github.com/apache/skywalking-banyandb/actions/runs/26878691166/job/79277978234)) failed with two `banyand/measure` tests: - `TestSnapshotFunctionality` → `expected 1 directory for flushed part, got 0` (`snapshot_test.go:491`) - `TestMeasureInitTSTableDeletesMultipleFailedSnapshotsOnFallback` → `"0" is not greater than or equal to "1"` (`snapshot_test.go:658`) ## Root cause Both tests used `require.Eventually` waiting for a **part directory** to appear in `tab/` as the "flush is done" gate. But that directory is created by the **first line** of `memPart.mustFlush` (`MkdirPanicIfExist`, `part.go:208`) — *before* the mem→file introduction reaches the in-memory snapshot and *before* the `.snp` manifest is persisted (`introduceFlushed` → `persistSnapshot`). So the gate fires early, and under `-race`/CI load: - `TestSnapshotFunctionality`: `TakeFileSnapshot` skips mem parts and returns `(false, nil)` when there are no disk parts; the test ignored `success` and asserted on entries → **0**. - `...DeletesMultipleFailedSnapshotsOnFallback`: `Close()` cancels the loopCloser; the introducer shutdown drain uses a non-blocking `default`, so an in-flight flush whose `flushCh <- ind` hasn't been received is dropped → no `.snp` persisted → **0**. The two PANIC lines in the CI log are unrelated noise: `cannot parse ...snp: unexpected end of JSON input` is the deliberate `require.Panics` in `TestMustReadSnapshotPanicsOnError`; `directory is exist` is a panicdiag-recovered flusher panic. ## Fix Add `waitForPersistedSnapshot`, which gates on the persisted `.snp` manifest existing instead of the part directory, and apply it to the four tests sharing this pattern (`TestSnapshotFunctionality`, `TestTolerantLoaderFallbackToOlderSnapshot`, `TestMeasureInitTSTableDeletesMultipleFailedSnapshotsOnFallback`, `TestInitTSTableLoadsNewestWhenMultipleValid`). Also assert `success == true` from `TakeFileSnapshot`, and drop the now-redundant `time.Sleep(100ms)` calls. This is a test-only change; no production behavior is modified. ## Verification - Reproduced the failure deterministically (100%) by removing the masking sleep: the in-memory snapshot held `memParts=2, fileParts=0` while the part directory already existed on disk. - After the fix: **300/300** green under `-race` for the four tests (previously ~1/60 flaky). - Local CI: `make license-check / check-req / build / lint / check` all pass; `make test-ci PKG=./banyand/...` (ginkgo `--race`) — Test Suite Passed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
