maobaolong opened a new issue, #2269:
URL: https://github.com/apache/incubator-uniffle/issues/2269

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the bug
   
   Client continue to reportShuffleResult even interrupted by caller. this can 
cause server received duplicated reportShufflerResult rpc which harm for server 
cpu.
   
   It should better to stop retrying and throw RssException.
   
   ### Affects Version(s)
   
   master
   
   ### Uniffle Server Log Output
   
   ```logtalk
   - Server.log
   
   [2024-11-29 02:43:02.906] [Grpc-311] [WARN] ShuffleServerGrpcService - 
Existing 300 duplicated blockIds on blockId report for appId: 
application_1729849261128_3823937_1732816526781, shuffleId: 6
   ```
   
   - Server Rpc audit log
   ```
   [2024-11-29 02:43:02.893] appId 
application_1729849261128_3823937_1732816526781 cmd=reportShuffleResult 
argstaskAttemptId=16897, bitmapNum=10, partitionToBlockIdsSize=300
   ```
   
   
   <img width="1601" alt="image" 
src="https://github.com/user-attachments/assets/2e2a774e-64bf-4178-9e12-93ee433384e7";>
   ```
   
   
   ### Uniffle Engine Log Output
   
   ```logtalk
   spark.log
   
   
   [02:42:51:164] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] INFO  org.apache.spark.shuffle.writer.RssShuffleWriter.<init>:193 
- RssShuffle start write taskAttemptId[16897] data with RssHandle[appId 
application_1729849261128_3823937_1732816526781, shuffleId 6].
   
   [02:42:56:731] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] INFO  org.apache.spark.shuffle.writer.WriteBufferManager.clear:365 
- Flush total buffer for shuffleId[6] with allocated[134217728], 
dataSize[113121456], memoryUsed[124038144], number of blocks[6000], flush 
ratio[1.0]
   [02:42:59:230] [dispatcher-Executor] INFO  
org.apache.spark.executor.Executor.logInfo:61 - Executor is trying to kill task 
2112.1 in stage 9.0 (TID 54359), reason: another attempt succeeded
   
   [02:43:03:205] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] WARN  
org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.doReportShuffleResult:797
 - Report shuffle result to host[192.168.1.100], port[19977] failed, try again, 
retryNum[100]
   org.apache.uniffle.shaded.io.grpc.StatusRuntimeException: CANCELLED: Thread 
interrupted
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:268)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:249)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:167)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.proto.ShuffleServerGrpc$ShuffleServerBlockingStub.reportShuffleResult(ShuffleServerGrpc.java:814)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.doReportShuffleResult(ShuffleServerGrpcClient.java:793)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.client.impl.grpc.ShuffleServerGrpcClient.reportShuffleResult(ShuffleServerGrpcClient.java:762)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.client.impl.ShuffleWriteClientImpl.reportShuffleResult(ShuffleWriteClientImpl.java:724)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.spark.shuffle.writer.RssShuffleWriter.stop(RssShuffleWriter.java:778)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:60)
 ~[spark-core_2.12-3.3.1.jar:3.3.1]
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) 
~[spark-core_2.12-3.3.1.jar:3.3.1]
        at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) 
~[spark-core_2.12-3.3.1.jar:3.3.1]
        at org.apache.spark.scheduler.Task.run(Task.scala:136) 
~[spark-core_2.12-3.3.1.jar:3.3.1]
        at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
 ~[spark-core_2.12-3.3.1.jar:3.3.1]
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1556) 
~[spark-core_2.12-3.3.1.jar:3.3.1]
        at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) 
~[spark-core_2.12-3.3.1.jar:3.3.1]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
~[?:1.8.0_392]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
~[?:1.8.0_392]
        at java.lang.Thread.run(Thread.java:750) ~[?:1.8.0_392]
   Caused by: java.lang.InterruptedException
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls$ThreadlessExecutor.throwIfInterrupted(ClientCalls.java:750)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls$ThreadlessExecutor.waitAndDrain(ClientCalls.java:711)
 ~[rss-client-spark3-shaded.jar:?]
        at 
org.apache.uniffle.shaded.io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:159)
 ~[rss-client-spark3-shaded.jar:?]
        ... 15 more
   [02:43:03:206] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] WARN  
org.apache.uniffle.client.impl.ShuffleWriteClientImpl.reportShuffleResult:747 - 
Report shuffle result is failed to ShuffleServerInfo{host[192.168.1.100], grpc 
port[19977], netty port[17000]} for 
appId[application_1729849261128_3823937_1732816526781], shuffleId[6]
   [02:43:03:208] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] WARN  org.apache.spark.memory.ExecutionMemoryPool.logWarning:73 - 
Internal error: release called on 10179584 bytes but task only has 0 bytes of 
memory from the on-heap execution pool
   [02:43:03:208] [Executor task launch worker for task 2112.1 in stage 9.0 
(TID 54359)] INFO  org.apache.spark.executor.Executor.logInfo:61 - Executor 
interrupted and killed task 2112.1 in stage 9.0 (TID 54359), reason: another 
attempt succeeded
   ```
   ```
   
   
   ### Uniffle Server Configurations
   
   _No response_
   
   ### Uniffle Engine Configurations
   
   _No response_
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to