GoodJeek opened a new issue, #2020:
URL: https://github.com/apache/incubator-uniffle/issues/2020

   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the bug
   
   ```
   _[2024-08-07 16:43:19.813] [Grpc-166] [INFO] 
CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments 
for appId[spark-82104866780a4adab8b4bedc7616459e_1723016966711], shuffleId[8], 
partitionNum[1],  partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, 
GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0]
   [2024-08-07 16:43:19.813] [Grpc-166] [INFO] 
CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments 
for appId[spark-82104866780a4adab8b4bedc7616459e_1723016966711], shuffleId[8], 
partitionNum[1],  partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, 
GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0]
   [2024-08-07 16:43:19.813] [Grpc-166] [ERROR] 
CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle 
assignments for app: spark-82104866780a4adab8b4bedc7616459e_1723016966711, 
shuffleId: 8, partitionNum: 1, partitionNumPerRange: 1, replica: 1, 
requiredTags: [ss_v4, GRPC]
   org.apache.uniffle.common.exception.RssException: There isn't enough shuffle 
servers
        at 
org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138)
        at 
org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139)
        at 
org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032)
        at 
io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at 
io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
        at 
io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   [2024-08-07 16:43:19.813] [Grpc-166] [ERROR] 
CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle 
assignments for app: spark-82104866780a4adab8b4bedc7616459e_1723016966711, 
shuffleId: 8, partitionNum: 1, partitionNumPerRange: 1, replica: 1, 
requiredTags: [ss_v4, GRPC]
   org.apache.uniffle.common.exception.RssException: There isn't enough shuffle 
servers
        at 
org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138)
        at 
org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139)
        at 
org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032)
        at 
io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at 
io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
        at 
io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   [2024-08-07 16:43:41.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:43:41.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:44:11.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:44:11.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:44:41.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:44:41.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:45:11.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:45:11.259] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 16:45:21.271] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19999, 10.42.3.0-19999, 10.42.0.0-19999]
   [2024-08-07 16:45:21.271] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19999, 10.42.3.0-19999, 10.42.0.0-19999]_
   ```
   
   ### Affects Version(s)
   
   rss-0.9.0
   
   ### Uniffle Server Log Output
   
   ```logtalk
   [2024-08-07 17:44:49.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:44:59.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:09.440] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 17:45:09.452] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:19.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:29.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:39.440] [ApplicationManager-0] [INFO] 
ApplicationManager.statusCheck - Start to check status for 0 applications.
   [2024-08-07 17:45:39.452] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:48.579] [Grpc-8] [INFO] 
CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments 
for appId[spark-f958a7892647495a8e01ae93384cf382_1723023715224], shuffleId[0], 
partitionNum[1],  partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, 
GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0]
   [2024-08-07 17:45:48.580] [Grpc-8] [ERROR] 
CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle 
assignments for app: spark-f958a7892647495a8e01ae93384cf382_1723023715224, 
shuffleId: 0, partitionNum: 1, partitionNumPerRange: 1, replica: 1, 
requiredTags: [ss_v4, GRPC]
   org.apache.uniffle.common.exception.RssException: There isn't enough shuffle 
servers
        at 
org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138)
        at 
org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139)
        at 
org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032)
        at 
io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182)
        at 
io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35)
        at 
io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23)
        at 
io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351)
        at 
io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861)
        at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
        at 
io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
   [2024-08-07 17:45:49.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   [2024-08-07 17:45:59.453] [SimpleClusterManager-0] [INFO] 
SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: 
[10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997]
   ```
   
   
   ### Uniffle Engine Log Output
   
   ```logtalk
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
explanation.
   SLF4J: Actual binding is of type 
[org.apache.logging.slf4j.Log4jLoggerFactory]
   [2024-08-07 17:34:00.924] [main] [INFO] ShuffleServer.main - Start to init 
shuffle server using config /root/rss-0.9.0-hadoop2.8/conf/server.conf
   [2024-08-07 17:34:00.956] [main] [INFO] RssUtils.getPropertiesFromFile - 
Load config from /root/rss-0.9.0-hadoop2.8/conf/server.conf
   [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:ecee:eeff:feee:eeee%calibf4c7ca0288 was filtered, because it don't 
have effective broadcast address
   [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:ecee:eeff:feee:eeee%cali7d3a14c3162 was filtered, because it don't 
have effective broadcast address
   [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:ecee:eeff:feee:eeee%calib1845bcbf52 was filtered, because it don't 
have effective broadcast address
   [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:ecee:eeff:feee:eeee%cali62d5012223a was filtered, because it don't 
have effective broadcast address
   [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:ecee:eeff:feee:eeee%calif391f1f84e0 was filtered, because it don't 
have effective broadcast address
   [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:80c1:c3ff:fe9b:3992%flannel.1 was filtered, because it don't have 
effective broadcast address
   [2024-08-07 17:34:01.008] 
[org.apache.uniffle.common.util.JvmPauseMonitor$Monitor@59402b8f] [INFO] 
JvmPauseMonitor.run - Starting JVM pause monitor
   [2024-08-07 17:34:01.008] [main] [INFO] RssUtils.getHostIp - ip 10.42.3.0 
was candidate, if there is no better choice, we will choose it
   [2024-08-07 17:34:01.008] [main] [INFO] RssUtils.getHostIp - ip 
fe80:0:0:0:d604:e6ff:fe4f:16ea%bond0 was filtered, because it don't have 
effective broadcast address
   [2024-08-07 17:34:01.009] [main] [INFO] RssUtils.getHostIp - ip 
10.39.215.220 was filtered, because it's not first effect site local address
   [2024-08-07 17:34:01.013] [main] [INFO] ShuffleServer.initServerTags - 
Server tags: [ss_v5, GRPC]
   [2024-08-07 17:34:01.031] [main] [INFO] log.initialized - Logging 
initialized @2319ms
   [2024-08-07 17:34:01.112] [main] [INFO] ShuffleServer.registerMetrics - 
Register metrics
   [2024-08-07 17:34:01.168] [main] [INFO] RPCMetrics.<init> - Init summary 
observe thread pool, core size:2, max size:20, keep alive time:60
   [2024-08-07 17:34:01.175] [main] [INFO] RPCMetrics.<init> - Init summary 
observe thread pool, core size:2, max size:20, keep alive time:60
   [2024-08-07 17:34:01.600] [main] [INFO] LocalStorageManager.<init> - Succeed 
to initialize storage paths: [/storage/disk2/shuffledata, 
/storage/disk3/shuffledata, /storage/disk4/shuffledata, 
/storage/disk5/shuffledata]
   [2024-08-07 17:34:01.608] [main] [INFO] 
CoordinatorClientFactory.createCoordinatorClient - Start to create coordinator 
clients from 10.39.215.218:19999
   [2024-08-07 17:34:02.000] [main] [INFO] CoordinatorGrpcClient.<init> - 
Created CoordinatorGrpcClient, host:10.39.215.218, port:19999, 
maxRetryAttempts:3, usePlaintext:true
   [2024-08-07 17:34:02.001] [main] [INFO] 
CoordinatorClientFactory.createCoordinatorClient - Add coordinator client 
Coordinator grpc client ref to 10.39.215.218:19999
   [2024-08-07 17:34:02.004] [main] [INFO] 
CoordinatorClientFactory.createCoordinatorClient - Finish create coordinator 
clients Coordinator grpc client ref to 10.39.215.218:19999
   [2024-08-07 17:34:02.057] [main] [INFO] 
DefaultFlushEventHandler.createFlushEventExecutor - CreateFlushPool, 
poolSize:10, keepAliveTime:5, queueSize:2147483647
   [2024-08-07 17:34:02.058] [main] [INFO] 
DefaultFlushEventHandler.createFlushEventExecutor - CreateFlushPool, 
poolSize:5, keepAliveTime:5, queueSize:2147483647
   [2024-08-07 17:34:02.061] [main] [INFO] ShuffleBufferManager.<init> - Init 
shuffle buffer manager with capacity: 42949672960, read buffer capacity: 
21474836480.
   [2024-08-07 17:34:02.103] [main] [INFO] 
TopNShuffleDataSizeOfAppCalcTask.start - TopNShuffleDataSizeOfAppCalcTask start 
schedule.
   [2024-08-07 17:34:02.121] [main] [INFO] Server.doStart - 
jetty-9.3.24.v20180605, build timestamp: 2018-06-06T01:11:56+08:00, git hash: 
84205aa28f11a4f31f2a3b86d1bba2cc8ab69827
   [2024-08-07 17:34:02.156] [main] [INFO] ContextHandler.doStart - Started 
o.e.j.s.ServletContextHandler@3113a37{/,null,AVAILABLE}
   [2024-08-07 17:34:02.176] [main] [INFO] AbstractConnector.doStart - Started 
ServerConnector@68f644d1{HTTP/1.1,[http/1.1]}{0.0.0.0:19996}
   [2024-08-07 17:34:02.176] [main] [INFO] Server.doStart - Started @3466ms
   [2024-08-07 17:34:02.176] [main] [INFO] JettyServer.start - Jetty http 
server started, listening on port 19996
   [2024-08-07 17:34:02.516] [main] [INFO] GrpcServer.startOnPort - Grpc server 
started, configured port: 19997, listening on 19997.
   [2024-08-07 17:34:02.516] [main] [INFO] ShuffleServer.start - Start to 
shuffle server with id 10.42.3.0-19997
   [2024-08-07 17:34:02.518] [main] [INFO] RegisterHeartBeat.startHeartBeat - 
Start heartbeat to coordinator 10.39.215.218:19999 after 10000ms and interval 
is 10000ms
   [2024-08-07 17:34:02.520] [main] [INFO] NettyDirectMemoryTracker.start - 
Start report direct memory usage to MetricSystem after 10000ms and interval is 
10000ms
   [2024-08-07 17:34:02.522] [main] [INFO] ShuffleServer.start - Shuffle server 
start successfully!
   ```
   
   
   ### Uniffle Server Configurations
   
   ```yaml
   rss.rpc.server.port 19999
   rss.jetty.http.port 19998
   rss.coordinator.server.heartbeat.timeout 30000
   rss.coordinator.app.expired 60000
   rss.coordinator.shuffle.nodes.max 3
   #rss.coordinator.exclude.nodes.file.path file:///xxx
   rss.coordinator.select.partition.strategy CONTINUOUS
   rss.coordinator.dynamicClientConf.enabled false
   rss.coordinator.exclude.nodes.file.path 
/root/rss-0.9.0-hadoop2.8/conf/exclude_nodes
   ```
   
   
   ### Uniffle Engine Configurations
   
   ```yaml
   rss.rpc.server.port 19997
   rss.jetty.http.port 19996
   rss.storage.basePath 
/storage/disk2/shuffledata,/storage/disk3/shuffledata,/storage/disk4/shuffledata,/storage/disk5/shuffledata
   rss.storage.type MEMORY_LOCALFILE
   rss.coordinator.quorum 10.39.215.218:19999
   rss.server.buffer.capacity 40gb
   rss.server.read.buffer.capacity 20gb
   rss.server.flush.thread.alive 5
   rss.server.flush.localfile.threadPool.size 10
   rss.server.flush.hadoop.threadPool.size 60
   rss.server.disk.capacity 1t
   rss.server.single.buffer.flush.enabled true
   rss.server.single.buffer.flush.threshold 128m
   
   rss.server.heartbeat.interval 10000
   rss.rpc.message.max.size 1073741824
   rss.server.preAllocation.expired 120000
   rss.server.commit.timeout 600000
   rss.server.app.expired.withoutHeartbeat 120000
   ```
   
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to