[
https://issues.apache.org/jira/browse/IGNITE-28685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dmitry Pavlov updated IGNITE-28685:
-----------------------------------
Labels: MakeTeamcityGreenAgain cpp test-flakiness thin-client (was: cpp
test-flakiness thin-client)
> C++ thin client test IgniteClientUserThreadPoolSize is flaky
> ------------------------------------------------------------
>
> Key: IGNITE-28685
> URL: https://issues.apache.org/jira/browse/IGNITE-28685
> Project: Ignite
> Issue Type: Bug
> Reporter: Ignite TC Bot
> Priority: Major
> Labels: MakeTeamcityGreenAgain, cpp, test-flakiness, thin-client
>
> h2. Problem
> TeamCity intermittently fails the C++ thin client test:
> * Suite: Platform C++ CMake (Linux)
> * Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060050
> * Test: IgniteThinClientTest: IgniteClientTestSuite:
> IgniteClientUserThreadPoolSize
> * Observed duration: about 14 ms
> The failure was seen on PR 13121, but that PR changes only notice/license
> files, so it is very unlikely to be caused by the current PR code changes.
> h2. Investigation
> The test is implemented in:
> * modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp
> * method: IgniteClientTestSuiteFixture::CheckThreadsNum
> * test case: IgniteClientUserThreadPoolSize
> The test starts a fake thin-client server on the fixed port 11110, then
> measures the total process thread count through /proc/<pid>/task before and
> immediately after IgniteClient::Start. It expects the exact delta to be
> userThreadPoolSize + 1 on Linux.
> This is brittle because:
> * the assertion uses the whole process thread count, not only the client/user
> pool threads;
> * startup/shutdown of native worker threads is asynchronous from the test
> point of view;
> * port 11110 is reused by many other thin-client tests and configs in the
> same suite;
> * base-branch history shows this test is flaky with a high failure rate.
> Relevant code paths:
> * modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp:
> CheckThreadsNum / IgniteClientUserThreadPoolSize
> * modules/platforms/cpp/thin-client-test/include/test_server.h: TestServer
> * modules/platforms/cpp/thin-client/src/impl/data_router.cpp:
> DataRouter::Connect / DataRouter::Close
> * modules/platforms/cpp/common/src/common/thread_pool.cpp: ThreadPool::Start
> / ThreadPool::Stop
> * modules/platforms/cpp/common/os/linux/src/common/concurrent_os.cpp:
> GetThreadsCount
> h2. Proposed fix
> Stabilize the test without changing production behavior:
> # Start TestServer on an ephemeral port by constructing it with port 0.
> # Expose TestServer::GetPort() and set IgniteClientConfiguration endpoints to
> the actual bound port.
> # Replace single immediate thread-count samples with bounded waits for the
> expected count after client start and after client destruction.
> A local draft patch already applies this shape:
> * add uint16_t TestServer::GetPort() const in
> modules/platforms/cpp/thin-client-test/include/test_server.h;
> * in IgniteClientUserThreadPoolSize, use TestServer server(0), build endpoint
> from server.GetPort(), and wait up to 5 seconds for thread count to settle.
> h2. Expected result
> The test should continue checking that the configured user thread pool size
> affects the number of client worker threads, while avoiding fixed-port
> collisions and scheduler-timing races.
> h2. Validation
> Run:
> {code}
> ctest -R IgniteThinClientTest --output-on-failure
> {code}
> or run the C++ thin-client test executable with:
> {code}
> ignite-thin-client-tests
> --run_test=IgniteClientTestSuite/IgniteClientUserThreadPoolSize
> --catch_system_errors=no --log_level=all
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)