linrrzqqq opened a new pull request, #64039: URL: https://github.com/apache/doris/pull/64039
Problem Summary: Python UDF process pool initialization previously required the whole pool to finish initialization before BE could continue serving the query. In abnormal environments, Python process startup may hang or take a very long time in paths such as: - `fork` / child process creation - waiting for the Python Flight socket to appear - terminating and waiting for a failed child process to exit When one process slot gets stuck, the whole process pool initialization can be blocked. As a result, FE may hit the send fragments RPC timeout before BE returns a meaningful Python UDF error: `RpcException, msg: timeout when waiting for send fragments rpc, query timeout:900, left timeout for this operation:30`. be.log: ```text Initializing Python process pool for version 3.8.19 with 8 processes Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=20508 Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=40508 Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=60508 Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=80508 Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=100508 Python process pool initialization progress for version 3.8.19: waiting_slot=4/8, success=3, failed=0, elapsed_ms=120508 ``` ### Solution Change Python process pool initialization from "wait until all processes are created" to "return once at least one usable process is available". The pool no longer treats full-size initialization as a prerequisite for serving queries. Once one Python process is alive, the current query can proceed. Missing or failed process slots are repaired asynchronously by the existing health check / repair path. - Bound process pool initialization time, so BE can return `SERVICE_UNAVAILABLE` before FE send fragments RPC timeout. - Allow partial pool availability: initialization succeeds as long as one usable Python process exists. - Mark the first initialization round as completed after success or timeout, then rely on health check / repair to fill missing slots. - Add bounded wait/reap logic for Python child shutdown to avoid blocking indefinitely in `wait`. - Protect late init / repair workers from writing back after shutdown, and discard late duplicate processes safely. - Share repair guarding between foreground repair and health check to avoid duplicate repair pressure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
