maobaolong opened a new issue, #2269: URL: https://github.com/apache/incubator-uniffle/issues/2269
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug Client continue to reportShuffleResult even interrupted by caller. this can cause server received duplicated reportShufflerResult rpc which harm for server cpu. It should better to stop retrying and throw RssException. ### Affects Version(s) master ### Uniffle Server Log Output ```logtalk - Server.log [2024-11-29 02:43:02.906] [Grpc-311] [WARN] ShuffleServerGrpcService - Existing 300 duplicated blockIds on blockId report for appId: application_1729849261128_3823937_1732816526781, shuffleId: 6 ``` - Server Rpc audit log ``` [2024-11-29 02:43:02.893] appId application_1729849261128_3823937_1732816526781 cmd=reportShuffleResult argstaskAttemptId=16897, bitmapNum=10, partitionToBlockIdsSize=300 ``` <img width="1601" alt="image" src="https://github.com/user-attachments/assets/2e2a774e-64bf-4178-9e12-93ee433384e7"> ``` ### Uniffle Engine Log Output ```logtalk spark.log [02:42:51:164] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] INFO org.apache.spark.shuffle.writer.RssShuffleWriter.<init>:193 - RssShuffle start write taskAttemptId[16897] data with RssHandle[appId application_1729849261128_3823937_1732816526781, shuffleId 6]. [02:42:56:731] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] INFO org.apache.spark.shuffle.writer.WriteBufferManager.clear:365 - Flush total buffer for shuffleId[6] with allocated[134217728], dataSize[113121456], memoryUsed[124038144], number of blocks[6000], flush ratio[1.0] [02:42:59:230] [dispatcher-Executor] INFO org.apache.spark.executor.Executor.logInfo:61 - Executor is trying to kill task 2112.1 in stage 9.0 (TID 54359), reason: another attempt succeeded [02:43:03:205] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] WARN org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.doReportShuffleResult:797 - Report shuffle result to host[192.168.1.100], port[19977] failed, try again, retryNum[100] org.apache.uniffle.shaded.io.grpc.StatusRuntimeException: CANCELLED: Thread interrupted at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.proto.ShuffleServerGrpc$ShuffleServerBlockingStub.reportShuffleResult(ShuffleServerGrpc.java:814) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.doReportShuffleResult(ShuffleServerGrpcClient.java:793) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.reportShuffleResult(ShuffleServerGrpcClient.java:762) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.client.impl.ShuffleWriteClientImpl.reportShuffleResult(ShuffleWriteClientImpl.java:724) ~[rss-client-spark3-shaded.jar:?] at org.apache.spark.shuffle.writer.RssShuffleWriter.stop(RssShuffleWriter.java:778) ~[rss-client-spark3-shaded.jar:?] at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:60) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.scheduler.Task.run(Task.scala:136) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1556) ~[spark-core_2.12-3.3.1.jar:3.3.1] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) ~[spark-core_2.12-3.3.1.jar:3.3.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_392] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_392] at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392] Caused by: java.lang.InterruptedException at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls$ThreadlessExecutor.throwIfInterrupted(ClientCalls.java:750) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:711) ~[rss-client-spark3-shaded.jar:?] at org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:159) ~[rss-client-spark3-shaded.jar:?] ... 15 more [02:43:03:206] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] WARN org.apache.uniffle.client.impl.ShuffleWriteClientImpl.reportShuffleResult:747 - Report shuffle result is failed to ShuffleServerInfo{host[192.168.1.100], grpc port[19977], netty port[17000]} for appId[application_1729849261128_3823937_1732816526781], shuffleId[6] [02:43:03:208] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] WARN org.apache.spark.memory.ExecutionMemoryPool.logWarning:73 - Internal error: release called on 10179584 bytes but task only has 0 bytes of memory from the on-heap execution pool [02:43:03:208] [Executor task launch worker for task 2112.1 in stage 9.0 (TID 54359)] INFO org.apache.spark.executor.Executor.logInfo:61 - Executor interrupted and killed task 2112.1 in stage 9.0 (TID 54359), reason: another attempt succeeded ``` ``` ### Uniffle Server Configurations _No response_ ### Uniffle Engine Configurations _No response_ ### Additional context _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
