(skywalking-banyandb) 07/14: fix: stabilize flaky TestCollectWithPartialClosedSegments and fit flaky-test in 50min (#1100)

hanahmily Fri, 08 May 2026 01:39:23 -0700

This is an automated email from the ASF dual-hosted git repository.

hanahmily pushed a commit to branch v0.10.x
in repository https://gitbox.apache.org/repos/asf/skywalking-banyandb.git


commit 935db739e787cb36a26bc8f21e720035f4bcc996
Author: Gao Hongtao <[email protected]>
AuthorDate: Wed Apr 29 08:57:07 2026 +0800

    fix: stabilize flaky TestCollectWithPartialClosedSegments and fit 
flaky-test in 50min (#1100)
    
    * fix(measure,stream,trace): wait until all mem parts flushed in 
file_snapshot tests
    
    The file_snapshot wait loop only checked snapshot.creator, which races
    with concurrent introductions: when flushTimeout=0 the flusher fires
    per-part, so a flush of part N can flip creator to Flusher while part
    N+1 is still a mem part. Closing then drops the still-in-memory part,
    losing data on reopen.
    
    Replace the creator check with a scan over snapshot.parts for any
    partWrapper.mp != nil. Applied in measure/query_test.go (TestQueryResult
    and TestQueryResult_QuotaExceeded), stream/block_scanner_test.go,
    stream/query_by_idx_test.go, and trace/query_test.go.
    
    * test(measure,stream,trace): use require.Eventually for file_snapshot wait 
loop
    
    Replace the open-ended polling loop with require.Eventually, gated by a
    30s deadline, so a stuck flusher (regression, deadlock, IO issue) fails
    the test at the right line with a clear message instead of relying on
    the global go-test timeout.
    
    Extract the mem-part scan into per-package allPartsFlushed helpers so
    the predicate is shared between TestQueryResult and the QuotaExceeded
    variants in measure and stream. The trace package gets its own copy
    because partWrapper.mp is package-private.
    
    * chore(ci): set flaky-test --repeat to 0 to fit 50-min timeout
    
    The hourly flaky-test job consistently hit the 50-min cap because
    the integration suite needed ~25 min per iteration and --repeat 3
    made it loop 4 times. Drop --repeat to 0 so one iteration runs and
    completes inside the cap and the 1-hour cron interval.
    
    * fix(storage): stabilize TestCollectWithPartialClosedSegments
    
    Raise SegmentIdleTimeout from 100ms to 1h so wall-clock variance on slow
    CI runners no longer flips still-open segments past the idle threshold.
    The manual lastAccessed override is bumped to 2h ago so the segments the
    test marks idle stay past the new threshold.
---
 .github/workflows/flaky-test.yml      | 2 +-
 CHANGES.md                            | 2 ++
 banyand/internal/storage/tsdb_test.go | 4 ++--
 3 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/.github/workflows/flaky-test.yml b/.github/workflows/flaky-test.yml
index f5e0b57c4..d2a8a8b9f 100644
--- a/.github/workflows/flaky-test.yml
+++ b/.github/workflows/flaky-test.yml
@@ -32,5 +32,5 @@ jobs:
     uses: ./.github/workflows/test.yml
     with:
       test-name: Flaky Tests
-      options:  --vv --repeat 3 --label-filter '(integration&&!slow)||banyand'
+      options:  --vv --repeat 0 --label-filter '(integration&&!slow)||banyand'
       timeout-minutes: 50
diff --git a/CHANGES.md b/CHANGES.md
index eac054fab..d5b5ab50d 100644
--- a/CHANGES.md
+++ b/CHANGES.md
@@ -28,11 +28,13 @@ Release Notes.
 - Fix lifecycle migration panic when a stream shard's snapshot has no element 
index (`idx/`) directory.
 - Avoid FODC lifecycle inspection failing on busy data nodes by raising the 
per-broadcast `CollectDataInfo` / `CollectLiaisonInfo` deadline from 5s to 30s 
and parallelizing per-group inspection in the cluster-internal `InspectAll`.
 - Fix flaky `file_snapshot` subtest in measure/stream/trace by waiting until 
every introduced mem part has been flushed to disk, instead of only checking 
the latest snapshot creator.
+- Fix flaky `TestCollectWithPartialClosedSegments` by raising 
`SegmentIdleTimeout` so wall-clock variance on slow CI does not mark still-open 
segments as idle.
 
 ### Chores
 
 - Upgrade Go and npm dependencies including etcd to v3.6.10, OpenTelemetry to 
v1.43.0, AWS SDK, and Google Cloud libraries.
 - Regenerate expired TLS test certificate with 100-year validity.
+- Set Ginkgo `--repeat` to 0 in the flaky-test workflow so the hourly run 
completes within the 50-minute timeout.
 
 ## 0.10.0
 
diff --git a/banyand/internal/storage/tsdb_test.go 
b/banyand/internal/storage/tsdb_test.go
index f07263741..6b786a9c5 100644
--- a/banyand/internal/storage/tsdb_test.go
+++ b/banyand/internal/storage/tsdb_test.go
@@ -575,7 +575,7 @@ func TestCollectWithPartialClosedSegments(t *testing.T) {
                TTL:                IntervalRule{Unit: DAY, Num: 7},
                ShardNum:           2,
                TSTableCreator:     MockTSTableCreator,
-               SegmentIdleTimeout: 100 * time.Millisecond, // Short idle 
timeout for testing
+               SegmentIdleTimeout: time.Hour, // Long enough that only 
segments with manually backdated lastAccessed are idle
        }
 
        ctx := context.Background()
@@ -627,7 +627,7 @@ func TestCollectWithPartialClosedSegments(t *testing.T) {
        for _, s := range ss {
                // Find segments we want to mark as idle (first and third)
                if s.Start.Equal(segmentDates[0]) || 
s.Start.Equal(segmentDates[2]) {
-                       
s.lastAccessed.Store(time.Now().Add(-time.Hour).UnixNano())
+                       s.lastAccessed.Store(time.Now().Add(-2 * 
time.Hour).UnixNano())
                        s.DecRef() // Force close
                }
                s.DecRef() // Release our reference from segments()

(skywalking-banyandb) 07/14: fix: stabilize flaky TestCollectWithPartialClosedSegments and fit flaky-test in 50min (#1100)

Reply via email to