[ 
https://issues.apache.org/jira/browse/IGNITE-28685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Pavlov updated IGNITE-28685:
-----------------------------------
    Labels: MakeTeamcityGreenAgain cpp test-flakiness thin-client  (was: cpp 
test-flakiness thin-client)

> C++ thin client test IgniteClientUserThreadPoolSize is flaky
> ------------------------------------------------------------
>
>                 Key: IGNITE-28685
>                 URL: https://issues.apache.org/jira/browse/IGNITE-28685
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ignite TC Bot
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain, cpp, test-flakiness, thin-client
>
> h2. Problem
> TeamCity intermittently fails the C++ thin client test:
> * Suite: Platform C++ CMake (Linux)
> * Build: https://ci2.ignite.apache.org/viewLog.html?buildId=9060050
> * Test: IgniteThinClientTest: IgniteClientTestSuite: 
> IgniteClientUserThreadPoolSize
> * Observed duration: about 14 ms
> The failure was seen on PR 13121, but that PR changes only notice/license 
> files, so it is very unlikely to be caused by the current PR code changes.
> h2. Investigation
> The test is implemented in:
> * modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp
> * method: IgniteClientTestSuiteFixture::CheckThreadsNum
> * test case: IgniteClientUserThreadPoolSize
> The test starts a fake thin-client server on the fixed port 11110, then 
> measures the total process thread count through /proc/<pid>/task before and 
> immediately after IgniteClient::Start. It expects the exact delta to be 
> userThreadPoolSize + 1 on Linux.
> This is brittle because:
> * the assertion uses the whole process thread count, not only the client/user 
> pool threads;
> * startup/shutdown of native worker threads is asynchronous from the test 
> point of view;
> * port 11110 is reused by many other thin-client tests and configs in the 
> same suite;
> * base-branch history shows this test is flaky with a high failure rate.
> Relevant code paths:
> * modules/platforms/cpp/thin-client-test/src/ignite_client_test.cpp: 
> CheckThreadsNum / IgniteClientUserThreadPoolSize
> * modules/platforms/cpp/thin-client-test/include/test_server.h: TestServer
> * modules/platforms/cpp/thin-client/src/impl/data_router.cpp: 
> DataRouter::Connect / DataRouter::Close
> * modules/platforms/cpp/common/src/common/thread_pool.cpp: ThreadPool::Start 
> / ThreadPool::Stop
> * modules/platforms/cpp/common/os/linux/src/common/concurrent_os.cpp: 
> GetThreadsCount
> h2. Proposed fix
> Stabilize the test without changing production behavior:
> # Start TestServer on an ephemeral port by constructing it with port 0.
> # Expose TestServer::GetPort() and set IgniteClientConfiguration endpoints to 
> the actual bound port.
> # Replace single immediate thread-count samples with bounded waits for the 
> expected count after client start and after client destruction.
> A local draft patch already applies this shape:
> * add uint16_t TestServer::GetPort() const in 
> modules/platforms/cpp/thin-client-test/include/test_server.h;
> * in IgniteClientUserThreadPoolSize, use TestServer server(0), build endpoint 
> from server.GetPort(), and wait up to 5 seconds for thread count to settle.
> h2. Expected result
> The test should continue checking that the configured user thread pool size 
> affects the number of client worker threads, while avoiding fixed-port 
> collisions and scheduler-timing races.
> h2. Validation
> Run:
> {code}
> ctest -R IgniteThinClientTest --output-on-failure
> {code}
> or run the C++ thin-client test executable with:
> {code}
> ignite-thin-client-tests 
> --run_test=IgniteClientTestSuite/IgniteClientUserThreadPoolSize 
> --catch_system_errors=no --log_level=all
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to