wombatu-kun opened a new pull request, #16438:
URL: https://github.com/apache/iceberg/pull/16438

   ## Summary
   
   `kafka-connect-tests` is intermittently failing on the assertion in 
`IntegrationTestBase.assertSnapshotAdded()` (file 
`kafka-connect/kafka-connect-runtime/src/integration/java/org/apache/iceberg/connect/IntegrationTestBase.java`,
 currently line 237). The failure looks like a bare `java.lang.AssertionError 
at IntegrationTestBase.java:237` with no message.
   
   Root cause: the assertion is inside an 
`Awaitility.await().atMost(Duration.ofSeconds(30)).pollInterval(Duration.ofSeconds(1)).untilAsserted(...)`
 block that waits for the kafka-connect sink to commit the first snapshot. 
Under load on GitHub Actions runners the kafka-connect testcontainer stack 
(Kafka + Connect + Iceberg REST catalog + MinIO) plus connector deployment plus 
the first commit-interval cycle can exceed the 30-second budget, and the bare 
assertion error makes the failure hard to diagnose.
   
   Evidence the failure is timing-related and not a real bug:
   
   - Same assertion fails on **different parameterizations** of 
`TestIntegrationDynamicTable.testIcebergSink` between the two JDK matrix legs 
of the same CI run (JDK 17 fails `[1] null`, JDK 21 fails `[2] test_branch`) — 
alternating-side failure is the signature of timing flakiness, not a 
deterministic bug on one code path.
   - `@AfterEach` (line 105) drops tables and namespaces between parameterized 
runs, so the assertion failing on `hasSize(1)` has to mean `table.snapshots()` 
was empty at the 30s mark — i.e. the sink had not yet committed — rather than 
residual snapshots leaking across runs.
   - `iceberg.control.commit.interval-ms = 1000` (line 202): on a healthy run 
the connector commits within ~1s of becoming active, so the assertion typically 
succeeds well under the original 30s budget. The extra headroom in this PR only 
kicks in on contended runners.
   
   This PR makes two changes inside `IntegrationTestBase.runTest()` / 
`assertSnapshotAdded()`:
   
   - Raise the Awaitility `.atMost(...)` from 30s to 60s. No effect on healthy 
runs.
   - Add an AssertJ `.as("expected 1 snapshot on table %s after sink commit", 
tableId)` description so that the next timeout produces a labeled failure 
containing the table identifier instead of an unlabeled `AssertionError`.
   
   No production code is changed; only the integration test base class.
   
   ## Test plan
   
   - [ ] CI: `kafka-connect-tests (17)` passes
   - [ ] CI: `kafka-connect-tests (21)` passes
   - [ ] CI run re-runs cleanly across multiple attempts (no reintroduction of 
the flake under load)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to