GoodJeek opened a new issue, #2020: URL: https://github.com/apache/incubator-uniffle/issues/2020
### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://www.apache.org/foundation/policies/conduct) ### Search before asking - [X] I have searched in the [issues](https://github.com/apache/incubator-uniffle/issues?q=is%3Aissue) and found no similar issues. ### Describe the bug ``` _[2024-08-07 16:43:19.813] [Grpc-166] [INFO] CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments for appId[spark-82104866780a4adab8b4bedc7616459e_1723016966711], shuffleId[8], partitionNum[1], partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0] [2024-08-07 16:43:19.813] [Grpc-166] [INFO] CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments for appId[spark-82104866780a4adab8b4bedc7616459e_1723016966711], shuffleId[8], partitionNum[1], partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0] [2024-08-07 16:43:19.813] [Grpc-166] [ERROR] CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle assignments for app: spark-82104866780a4adab8b4bedc7616459e_1723016966711, shuffleId: 8, partitionNum: 1, partitionNumPerRange: 1, replica: 1, requiredTags: [ss_v4, GRPC] org.apache.uniffle.common.exception.RssException: There isn't enough shuffle servers at org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138) at org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139) at org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [2024-08-07 16:43:19.813] [Grpc-166] [ERROR] CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle assignments for app: spark-82104866780a4adab8b4bedc7616459e_1723016966711, shuffleId: 8, partitionNum: 1, partitionNumPerRange: 1, replica: 1, requiredTags: [ss_v4, GRPC] org.apache.uniffle.common.exception.RssException: There isn't enough shuffle servers at org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138) at org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139) at org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [2024-08-07 16:43:41.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:43:41.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:44:11.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:44:11.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:44:41.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:44:41.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:45:11.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:45:11.259] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 16:45:21.271] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19999, 10.42.3.0-19999, 10.42.0.0-19999] [2024-08-07 16:45:21.271] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19999, 10.42.3.0-19999, 10.42.0.0-19999]_ ``` ### Affects Version(s) rss-0.9.0 ### Uniffle Server Log Output ```logtalk [2024-08-07 17:44:49.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:44:59.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:09.440] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 17:45:09.452] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:19.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:29.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:39.440] [ApplicationManager-0] [INFO] ApplicationManager.statusCheck - Start to check status for 0 applications. [2024-08-07 17:45:39.452] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:48.579] [Grpc-8] [INFO] CoordinatorGrpcService.getShuffleAssignments - Request of getShuffleAssignments for appId[spark-f958a7892647495a8e01ae93384cf382_1723023715224], shuffleId[0], partitionNum[1], partitionNumPerRange[1], replica[1], requiredTags[[ss_v4, GRPC]], requiredShuffleServerNumber[-1],faultyServerIds[0] [2024-08-07 17:45:48.580] [Grpc-8] [ERROR] CoordinatorGrpcService.getShuffleAssignments - Errors on getting shuffle assignments for app: spark-f958a7892647495a8e01ae93384cf382_1723023715224, shuffleId: 0, partitionNum: 1, partitionNumPerRange: 1, replica: 1, requiredTags: [ss_v4, GRPC] org.apache.uniffle.common.exception.RssException: There isn't enough shuffle servers at org.apache.uniffle.coordinator.strategy.assignment.PartitionBalanceAssignmentStrategy.assign(PartitionBalanceAssignmentStrategy.java:138) at org.apache.uniffle.coordinator.CoordinatorGrpcService.getShuffleAssignments(CoordinatorGrpcService.java:139) at org.apache.uniffle.proto.CoordinatorServerGrpc$MethodHandlers.invoke(CoordinatorServerGrpc.java:1032) at io.grpc.stub.ServerCalls$UnaryServerCallHandler$UnaryServerCallListener.onHalfClose(ServerCalls.java:182) at io.grpc.PartialForwardingServerCallListener.onHalfClose(PartialForwardingServerCallListener.java:35) at io.grpc.ForwardingServerCallListener.onHalfClose(ForwardingServerCallListener.java:23) at io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.halfClosed(ServerCallImpl.java:351) at io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1HalfClosed.runInContext(ServerImpl.java:861) at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37) at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) [2024-08-07 17:45:49.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] [2024-08-07 17:45:59.453] [SimpleClusterManager-0] [INFO] SimpleClusterManager.nodesCheck - Alive servers number: 3, ids: [10.42.1.0-19997, 10.42.0.0-19997, 10.42.3.0-19997] ``` ### Uniffle Engine Log Output ```logtalk SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] [2024-08-07 17:34:00.924] [main] [INFO] ShuffleServer.main - Start to init shuffle server using config /root/rss-0.9.0-hadoop2.8/conf/server.conf [2024-08-07 17:34:00.956] [main] [INFO] RssUtils.getPropertiesFromFile - Load config from /root/rss-0.9.0-hadoop2.8/conf/server.conf [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:ecee:eeff:feee:eeee%calibf4c7ca0288 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:ecee:eeff:feee:eeee%cali7d3a14c3162 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.006] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:ecee:eeff:feee:eeee%calib1845bcbf52 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:ecee:eeff:feee:eeee%cali62d5012223a was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:ecee:eeff:feee:eeee%calif391f1f84e0 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.007] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:80c1:c3ff:fe9b:3992%flannel.1 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.008] [org.apache.uniffle.common.util.JvmPauseMonitor$Monitor@59402b8f] [INFO] JvmPauseMonitor.run - Starting JVM pause monitor [2024-08-07 17:34:01.008] [main] [INFO] RssUtils.getHostIp - ip 10.42.3.0 was candidate, if there is no better choice, we will choose it [2024-08-07 17:34:01.008] [main] [INFO] RssUtils.getHostIp - ip fe80:0:0:0:d604:e6ff:fe4f:16ea%bond0 was filtered, because it don't have effective broadcast address [2024-08-07 17:34:01.009] [main] [INFO] RssUtils.getHostIp - ip 10.39.215.220 was filtered, because it's not first effect site local address [2024-08-07 17:34:01.013] [main] [INFO] ShuffleServer.initServerTags - Server tags: [ss_v5, GRPC] [2024-08-07 17:34:01.031] [main] [INFO] log.initialized - Logging initialized @2319ms [2024-08-07 17:34:01.112] [main] [INFO] ShuffleServer.registerMetrics - Register metrics [2024-08-07 17:34:01.168] [main] [INFO] RPCMetrics.<init> - Init summary observe thread pool, core size:2, max size:20, keep alive time:60 [2024-08-07 17:34:01.175] [main] [INFO] RPCMetrics.<init> - Init summary observe thread pool, core size:2, max size:20, keep alive time:60 [2024-08-07 17:34:01.600] [main] [INFO] LocalStorageManager.<init> - Succeed to initialize storage paths: [/storage/disk2/shuffledata, /storage/disk3/shuffledata, /storage/disk4/shuffledata, /storage/disk5/shuffledata] [2024-08-07 17:34:01.608] [main] [INFO] CoordinatorClientFactory.createCoordinatorClient - Start to create coordinator clients from 10.39.215.218:19999 [2024-08-07 17:34:02.000] [main] [INFO] CoordinatorGrpcClient.<init> - Created CoordinatorGrpcClient, host:10.39.215.218, port:19999, maxRetryAttempts:3, usePlaintext:true [2024-08-07 17:34:02.001] [main] [INFO] CoordinatorClientFactory.createCoordinatorClient - Add coordinator client Coordinator grpc client ref to 10.39.215.218:19999 [2024-08-07 17:34:02.004] [main] [INFO] CoordinatorClientFactory.createCoordinatorClient - Finish create coordinator clients Coordinator grpc client ref to 10.39.215.218:19999 [2024-08-07 17:34:02.057] [main] [INFO] DefaultFlushEventHandler.createFlushEventExecutor - CreateFlushPool, poolSize:10, keepAliveTime:5, queueSize:2147483647 [2024-08-07 17:34:02.058] [main] [INFO] DefaultFlushEventHandler.createFlushEventExecutor - CreateFlushPool, poolSize:5, keepAliveTime:5, queueSize:2147483647 [2024-08-07 17:34:02.061] [main] [INFO] ShuffleBufferManager.<init> - Init shuffle buffer manager with capacity: 42949672960, read buffer capacity: 21474836480. [2024-08-07 17:34:02.103] [main] [INFO] TopNShuffleDataSizeOfAppCalcTask.start - TopNShuffleDataSizeOfAppCalcTask start schedule. [2024-08-07 17:34:02.121] [main] [INFO] Server.doStart - jetty-9.3.24.v20180605, build timestamp: 2018-06-06T01:11:56+08:00, git hash: 84205aa28f11a4f31f2a3b86d1bba2cc8ab69827 [2024-08-07 17:34:02.156] [main] [INFO] ContextHandler.doStart - Started o.e.j.s.ServletContextHandler@3113a37{/,null,AVAILABLE} [2024-08-07 17:34:02.176] [main] [INFO] AbstractConnector.doStart - Started ServerConnector@68f644d1{HTTP/1.1,[http/1.1]}{0.0.0.0:19996} [2024-08-07 17:34:02.176] [main] [INFO] Server.doStart - Started @3466ms [2024-08-07 17:34:02.176] [main] [INFO] JettyServer.start - Jetty http server started, listening on port 19996 [2024-08-07 17:34:02.516] [main] [INFO] GrpcServer.startOnPort - Grpc server started, configured port: 19997, listening on 19997. [2024-08-07 17:34:02.516] [main] [INFO] ShuffleServer.start - Start to shuffle server with id 10.42.3.0-19997 [2024-08-07 17:34:02.518] [main] [INFO] RegisterHeartBeat.startHeartBeat - Start heartbeat to coordinator 10.39.215.218:19999 after 10000ms and interval is 10000ms [2024-08-07 17:34:02.520] [main] [INFO] NettyDirectMemoryTracker.start - Start report direct memory usage to MetricSystem after 10000ms and interval is 10000ms [2024-08-07 17:34:02.522] [main] [INFO] ShuffleServer.start - Shuffle server start successfully! ``` ### Uniffle Server Configurations ```yaml rss.rpc.server.port 19999 rss.jetty.http.port 19998 rss.coordinator.server.heartbeat.timeout 30000 rss.coordinator.app.expired 60000 rss.coordinator.shuffle.nodes.max 3 #rss.coordinator.exclude.nodes.file.path file:///xxx rss.coordinator.select.partition.strategy CONTINUOUS rss.coordinator.dynamicClientConf.enabled false rss.coordinator.exclude.nodes.file.path /root/rss-0.9.0-hadoop2.8/conf/exclude_nodes ``` ### Uniffle Engine Configurations ```yaml rss.rpc.server.port 19997 rss.jetty.http.port 19996 rss.storage.basePath /storage/disk2/shuffledata,/storage/disk3/shuffledata,/storage/disk4/shuffledata,/storage/disk5/shuffledata rss.storage.type MEMORY_LOCALFILE rss.coordinator.quorum 10.39.215.218:19999 rss.server.buffer.capacity 40gb rss.server.read.buffer.capacity 20gb rss.server.flush.thread.alive 5 rss.server.flush.localfile.threadPool.size 10 rss.server.flush.hadoop.threadPool.size 60 rss.server.disk.capacity 1t rss.server.single.buffer.flush.enabled true rss.server.single.buffer.flush.threshold 128m rss.server.heartbeat.interval 10000 rss.rpc.message.max.size 1073741824 rss.server.preAllocation.expired 120000 rss.server.commit.timeout 600000 rss.server.app.expired.withoutHeartbeat 120000 ``` ### Additional context _No response_ ### Are you willing to submit PR? - [X] Yes I am willing to submit a PR! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
