ic4y opened a new issue, #6481: URL: https://github.com/apache/kyuubi/issues/6481
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/kyuubi/issues?q=is%3Aissue) and found no similar issues. ### What would you like to be improved? 以下测试都是在engine启动并未执行任何SQL的情况下 1、通过kyuubi 提交到spark engine 到driver内存使用情况如下:  2、通过spark sql 提交的driver内存使用情况如下:  我们通过对比可以得到kyuubi spark engine driver 的jvm 内存使用在 Class 、Thread、GC、Internal 这几个方面差异较大。kyuubi 提交的driver 堆外内存使用在1G左右。 1、虽然kyuubi 会比spark sql多额外的一些类和线程,但这还是值得注意的点(可以排查看是否可优化)。 2、这会导致spark.driver.memoryOverhead为默认值的情况下,Driver容器被异常kill(yarn 判断实际使用内存超限制)。 下面我举个例子说明 spark.driver.memoryOverhead 默认值为spark.driver.memory*0.1 和 384mb对比取最大的值。假设现在spark.driver.memory 设置为2048M, 那么memoryOverhead为400M。这时候yarn会在实际使用内存>2448M的时候kill drievr容器,但我们堆外内存已经使用了1024M的情况下我们实际可用内存还有1424Mb,并且我们的堆内存最大是2048M所以在正常使用的情况下内存可能内存达到1424Mb(可GC到200M左右)但还未进行full GC,然后程序就被kill。 这会导致一种情况就是 我程序设置spark.driver.memory*0.1=6G 程序不能正常运行(被kill)。但我设置spark.driver.memory*0.1=2G, spark.driver.memoryOverhead为1G程序却能成功运行。这就是因为memoryOverhead和实际堆外内存使用不匹配导致堆内内存使用不到spark.driver.memory的最大值就被kill,但又往往堆内存使用接近最大值才会full GC 回收掉一些内存。 所以强烈建议在大家确定的这个问题后,在文档显眼位置或者kyuubi 默认值中把 spark.driver.memoryOverhead 设置为1G。在Executor中是否存在类似问题还未测试。 这能解决很多sql 执行稳定性的问题。我在我们集群添加了这个参数,减少了很多失败SQL。 ----------------------------------------------------------------------------------------------------------- The following tests were conducted with the engine started but without executing any SQL. 1、The memory usage of the driver when submitting to the Spark engine via Kyuubi is as follows:  2、The memory usage of the driver when submitting through Spark SQL is as follows:  From the comparison, we can see that the JVM memory usage of the Kyuubi Spark engine driver differs significantly in the areas of Class, Thread, GC, and Internal. The off-heap memory usage of the driver submitted via Kyuubi is around 1GB. 1、Although Kyuubi does introduce some additional classes and threads compared to Spark SQL, this is still a noteworthy point (it can be examined for possible optimization). 2、This can lead to the driver container being killed abnormally when the spark.driver.memoryOverhead is set to the default value (YARN determines that the actual memory usage exceeds the limit). Let me give an example to illustrate: The default value of spark.driver.memoryOverhead is the greater of spark.driver.memory * 0.1 and 384MB. Suppose spark.driver.memory is set to 2048M, then the memoryOverhead is 400M. At this time, YARN will kill the driver container when the actual memory usage exceeds 2448M. However, when our off-heap memory usage is already 1024M, our actual available memory is only 1424MB. The maximum heap memory is 2048M, so under normal circumstances, the memory usage might reach 1424MB (can be GC'd down to around 200MB) but without triggering a full GC, the program gets killed. This leads to a situation where: If I set spark.driver.memory * 0.1 = 6G, the program cannot run normally (gets killed). But if I set spark.driver.memory * 0.1 = 2G and spark.driver.memoryOverhead to 1G, the program runs successfully. This is because the mismatch between memoryOverhead and the actual off-heap memory usage causes the heap memory to be used up to near the maximum before being killed, while typically a full GC would only occur when heap memory usage approaches the maximum. Therefore, it is strongly recommended that after confirming this issue, the spark.driver.memoryOverhead should be set to 1G in a prominent position in the documentation or as a default value in Kyuubi. Whether similar issues exist for the executor has not yet been tested. This adjustment can solve many SQL execution stability issues. Adding this parameter to our cluster has reduced many failed SQL executions. ### How should we improve? _No response_ ### Are you willing to submit PR? - [ ] Yes. I would be willing to submit a PR with guidance from the Kyuubi community to improve. - [ ] No. I cannot submit a PR at this time. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
