BohanZhang0222 opened a new issue, #6771:
URL: https://github.com/apache/kyuubi/issues/6771

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no 
similar issues.
   
   
   ### Describe the bug
   
   I use kyuubi batch api v2 to submit the spark job of yarn cluster.
   When the API node being called is inconsistent with the node submitting the 
spark job, an error message that the resource file cannot be found will be 
reported. I analyzed that the reason is that when I call the API, kyuubi will 
place the uploaded resource file in a local directory, but this directory is 
not shared among multiple workers of kyuubi. As a result, when the batch task 
is scheduled to be submitted to other nodes, the resource file cannot be found.
   
   ### Affects Version(s)
   
   1.9.1
   
   ### Kyuubi Server Log Output
   
   _No response_
   
   ### Kyuubi Engine Log Output
   
   _No response_
   
   ### Kyuubi Server Configurations
   
   ```yaml
   kyuubi.batch.impl.version=2
   kyuubi.batch.submitter.enabled=true
   ```
   
   
   ### Kyuubi Engine Configurations
   
   _No response_
   
   ### Additional context
   
   The solution I tried,
   kyuubi has an environment variable:`kyuubi_work_dir`,I changed this 
directory to point to the shared storage.
   But i failed,
   The problem encountered is that jobs are submitted occasionally. You can see 
the spark submission log in the kyuubi server, and you can also find the 
corresponding batch id in the database, but the submission is not successful 
and the Yarn App Id cannot be obtained. The status of kyuubi will change from 
PENDING to ERROR very quickly.
   
   By calling the locallog interface of batch, no valid error content could be 
found. (Because it was an accident in the production environment, it has been 
rolled back and no screenshots can be taken). However, the locallog interface 
mentions the detailed error log path, which is a log file in the username 
subdirectory in the kyuubi work directory (the shared directory configured in 
the environment variable).
   
   When I accessed this log file, I found that the file content described 
another job. At this time, I realized that the multi-node shared work directory 
may have caused job conflicts.
   
   I realized that the uploaded resource files might also have conflicts, so I 
executed the following query.
   
![image](https://github.com/user-attachments/assets/1a2f7f80-ad2c-4e5e-82d7-a585d835c230)
   
   It can be confirmed that shared directories will cause multi-node resource 
file and log conflicts.
   But I can't confirm whether this is the reason for the occasional task 
submission exception.
   
   
   ### Are you willing to submit PR?
   
   - [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi 
community to fix.
   - [X] No. I cannot submit a PR at this time.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@kyuubi.apache.org
For additional commands, e-mail: notifications-h...@kyuubi.apache.org

Reply via email to