pzhdfy opened a new pull request, #55302:
URL: https://github.com/apache/doris/pull/55302

   ### What problem does this PR solve?
   
   Related PR: https://github.com/apache/doris/pull/30082
   
   Problem Summary:
   
   We are using doris 3.0.4 with java udf, when running long time, some BE may 
got jvm deadlocks with UNKNOWN_owner_addr
   [deadlock in udf close]
   <img width="2314" height="882" alt="image" 
src="https://github.com/user-attachments/assets/c16b5d18-9900-45dc-8bb4-894180694a44";
 />
   [deadlock in throwableToStackTrace, waiting a StringWriter , but this is 
impossible , because StringWriter was newed everytime]
   <img width="1724" height="720" alt="image" 
src="https://github.com/user-attachments/assets/b586f752-0ba8-43cc-89b4-352a7d9ab00c";
 />
   <img width="618" height="144" alt="image" 
src="https://github.com/user-attachments/assets/20c2ad11-beef-4b0c-b2a2-343d9b0fd14c";
 />
   
   
   In be.WARNING we found  get JNIEnv failed when close udf
   <img width="1659" height="426" alt="image" 
src="https://github.com/user-attachments/assets/6cf54aa2-522a-407e-a53a-e17fdbc9162e";
 />
   In be.out  we found libhdfs error log [JNIEnv is got by libhdfs when in X86 ]
   Call to AttachCurrentThread failed with error: -1
   getJNIEnv: getGlobalJNIEnv failed
   
   In be.WARNING there are some other err  when using GetJniExceptionMsg
   <img width="1686" height="530" alt="image" 
src="https://github.com/user-attachments/assets/35ca7b4c-0a9d-4da3-8be2-9a07ccebb74b";
 />
   
   we found all those err has common ground
   1.all case occured  in JavaFunctionCall::close
   2.the stack all have bthread keyword
   
   after seaching the web ,  we found that JNI  is not compatible with bthread 
   https://blog.csdn.net/qq_46104835/article/details/139360911
   
https://gitee.com/baidu/BRPC/blob/master/docs/cn/server.md#pthread%E6%A8%A1%E5%BC%8F
   <img width="1101" height="211" alt="image" 
src="https://github.com/user-attachments/assets/e1da2f9a-9e37-47d9-b0a1-c1652da93e75";
 />
   
   Then we switch bthread to pthread mode , every thing works fine.
   
   We want to know how often bhread do the JavaFunctionCall::close, then we add 
metrics.  only 1/10000  JavaFunctionCall::close running in bthread
   <img width="1091" height="205" alt="image" 
src="https://github.com/user-attachments/assets/2f7d996c-c2bb-425d-9a07-92f0920fe018";
 />
   
   But why JavaFunctionCall::close occured in bthread[ after 
https://github.com/apache/doris/issues/16634,  exec_plan_fragment is running in 
pthread instead of bthread]
   
   then we found a pr
   https://github.com/apache/doris/pull/30082
   
   ExchangeSinkBuffer<Parent>::_send_rpc will set a send_callback with a 
weak_task_ctx
   <img width="1392" height="254" alt="image" 
src="https://github.com/user-attachments/assets/a1d28f18-c3c7-4953-baab-d1109f341aa1";
 />
   So sometimes send_callback may using weak_task_ctx.lock()  to get a 
shared_ptr to task_ctx,  then task_ctx destructor my be called in 
send_callback[  send_callback is running in bthread]
   
   So We modify JavaFunctionCall::close
   when JavaFunctionCall::close running in bthread, we submit the jni operation 
to a pthread pool, and wait it finish
   because only 1/10000   JavaFunctionCall::close are running in bthread.  the 
pthread pool size can set to a small number.
   
   
   
   ### Release note
   
   None
   
   ### Check List (For Author)
   
   - Test <!-- At least one of them must be included. -->
       - [ ] Regression test
       - [ ] Unit Test
       - [ ] Manual test (add detailed scripts or steps below)
       - [ ] No need to test or manual test. Explain why:
           - [ ] This is a refactor/code format and no logic has been changed.
           - [ ] Previous test can cover this change.
           - [ ] No code files have been changed.
           - [ ] Other reason <!-- Add your reason?  -->
   
   - Behavior changed:
       - [ ] No.
       - [ ] Yes. <!-- Explain the behavior change -->
   
   - Does this need documentation?
       - [ ] No.
       - [ ] Yes. <!-- Add document PR link here. eg: 
https://github.com/apache/doris-website/pull/1214 -->
   
   ### Check List (For Reviewer who merge this PR)
   
   - [ ] Confirm the release note
   - [ ] Confirm test cases
   - [ ] Confirm document
   - [ ] Add branch pick label <!-- Add branch pick label that this PR should 
merge into -->
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to