This is an automated email from the ASF dual-hosted git repository. hanahmily pushed a commit to branch v0.10.x in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git
commit 935db739e787cb36a26bc8f21e720035f4bcc996 Author: Gao Hongtao <[email protected]> AuthorDate: Wed Apr 29 08:57:07 2026 +0800 fix: stabilize flaky TestCollectWithPartialClosedSegments and fit flaky-test in 50min (#1100) * fix(measure,stream,trace): wait until all mem parts flushed in file_snapshot tests The file_snapshot wait loop only checked snapshot.creator, which races with concurrent introductions: when flushTimeout=0 the flusher fires per-part, so a flush of part N can flip creator to Flusher while part N+1 is still a mem part. Closing then drops the still-in-memory part, losing data on reopen. Replace the creator check with a scan over snapshot.parts for any partWrapper.mp != nil. Applied in measure/query_test.go (TestQueryResult and TestQueryResult_QuotaExceeded), stream/block_scanner_test.go, stream/query_by_idx_test.go, and trace/query_test.go. * test(measure,stream,trace): use require.Eventually for file_snapshot wait loop Replace the open-ended polling loop with require.Eventually, gated by a 30s deadline, so a stuck flusher (regression, deadlock, IO issue) fails the test at the right line with a clear message instead of relying on the global go-test timeout. Extract the mem-part scan into per-package allPartsFlushed helpers so the predicate is shared between TestQueryResult and the QuotaExceeded variants in measure and stream. The trace package gets its own copy because partWrapper.mp is package-private. * chore(ci): set flaky-test --repeat to 0 to fit 50-min timeout The hourly flaky-test job consistently hit the 50-min cap because the integration suite needed ~25 min per iteration and --repeat 3 made it loop 4 times. Drop --repeat to 0 so one iteration runs and completes inside the cap and the 1-hour cron interval. * fix(storage): stabilize TestCollectWithPartialClosedSegments Raise SegmentIdleTimeout from 100ms to 1h so wall-clock variance on slow CI runners no longer flips still-open segments past the idle threshold. The manual lastAccessed override is bumped to 2h ago so the segments the test marks idle stay past the new threshold. --- .github/workflows/flaky-test.yml | 2 +- CHANGES.md | 2 ++ banyand/internal/storage/tsdb_test.go | 4 ++-- 3 files changed, 5 insertions(+), 3 deletions(-) diff --git a/.github/workflows/flaky-test.yml b/.github/workflows/flaky-test.yml index f5e0b57c4..d2a8a8b9f 100644 --- a/.github/workflows/flaky-test.yml +++ b/.github/workflows/flaky-test.yml @@ -32,5 +32,5 @@ jobs: uses: ./.github/workflows/test.yml with: test-name: Flaky Tests - options: --vv --repeat 3 --label-filter '(integration&&!slow)||banyand' + options: --vv --repeat 0 --label-filter '(integration&&!slow)||banyand' timeout-minutes: 50 diff --git a/CHANGES.md b/CHANGES.md index eac054fab..d5b5ab50d 100644 --- a/CHANGES.md +++ b/CHANGES.md @@ -28,11 +28,13 @@ Release Notes. - Fix lifecycle migration panic when a stream shard's snapshot has no element index (`idx/`) directory. - Avoid FODC lifecycle inspection failing on busy data nodes by raising the per-broadcast `CollectDataInfo` / `CollectLiaisonInfo` deadline from 5s to 30s and parallelizing per-group inspection in the cluster-internal `InspectAll`. - Fix flaky `file_snapshot` subtest in measure/stream/trace by waiting until every introduced mem part has been flushed to disk, instead of only checking the latest snapshot creator. +- Fix flaky `TestCollectWithPartialClosedSegments` by raising `SegmentIdleTimeout` so wall-clock variance on slow CI does not mark still-open segments as idle. ### Chores - Upgrade Go and npm dependencies including etcd to v3.6.10, OpenTelemetry to v1.43.0, AWS SDK, and Google Cloud libraries. - Regenerate expired TLS test certificate with 100-year validity. +- Set Ginkgo `--repeat` to 0 in the flaky-test workflow so the hourly run completes within the 50-minute timeout. ## 0.10.0 diff --git a/banyand/internal/storage/tsdb_test.go b/banyand/internal/storage/tsdb_test.go index f07263741..6b786a9c5 100644 --- a/banyand/internal/storage/tsdb_test.go +++ b/banyand/internal/storage/tsdb_test.go @@ -575,7 +575,7 @@ func TestCollectWithPartialClosedSegments(t *testing.T) { TTL: IntervalRule{Unit: DAY, Num: 7}, ShardNum: 2, TSTableCreator: MockTSTableCreator, - SegmentIdleTimeout: 100 * time.Millisecond, // Short idle timeout for testing + SegmentIdleTimeout: time.Hour, // Long enough that only segments with manually backdated lastAccessed are idle } ctx := context.Background() @@ -627,7 +627,7 @@ func TestCollectWithPartialClosedSegments(t *testing.T) { for _, s := range ss { // Find segments we want to mark as idle (first and third) if s.Start.Equal(segmentDates[0]) || s.Start.Equal(segmentDates[2]) { - s.lastAccessed.Store(time.Now().Add(-time.Hour).UnixNano()) + s.lastAccessed.Store(time.Now().Add(-2 * time.Hour).UnixNano()) s.DecRef() // Force close } s.DecRef() // Release our reference from segments()
