yangbin09 opened a new issue, #3075:
URL: https://github.com/apache/celeborn/issues/3075
---
### Description:
After starting the Celeborn Worker, the following issues occur:
1. **Worker not getting resources (Slots)**
The logs show:
```
maxSlots: 0
activeSlots: 0
```
This indicates that the Worker is not being allocated any resources,
preventing it from executing any tasks.
2. **Incorrect disk space information**
In the heartbeat message sent from the Worker to the Master, the disk
space shows:
```
usableSpace: 8.0 EiB
totalSpace: 0.0 B
```
This suggests that the Worker is not correctly detecting the disk space,
which may indicate a problem with the storage path or mount.
3. **No running Shuffle tasks**
The logs show:
```
committed shuffles: 0
running applications: 0
```
This indicates that no tasks are being submitted to the Worker, likely
due to insufficient resources or failed shuffle allocation.
4. **Memory management is normal, but no Shuffle operations are taking
place**
Even though memory usage is normal:
```
Direct memory usage: 4.0 MiB/1024.0 MiB
```
No shuffle operations are occurring due to the lack of available
resources.
### Steps to Reproduce:
1. Start the Celeborn Worker.
2. Start a Spark job that performs a shuffle operation.
3. Check the logs of the Celeborn Worker, paying attention to the
`maxSlots`, `activeSlots`, `usableSpace`, `totalSpace`, and `committed
shuffles` fields.
### Log Summary:
Here are the relevant log entries:
```
25/01/21 16:57:32,635 DEBUG [worker-disk-checker] LocalDeviceMonitor: Device
check start
25/01/21 16:57:34,108 INFO [worker-memory-manager-reporter] MemoryManager:
Direct memory usage: 4.0 MiB/1024.0 MiB, disk buffer size: 0.0 B, sort memory
size: 0.0 B, read buffer size: 0.0 B, memory file storage size: 0.0 B
25/01/21 16:57:37,897 INFO [worker-forward-message-scheduler]
StorageManager: Updated diskInfos:
25/01/21 16:57:37,898 DEBUG [worker-forward-message-scheduler] MasterClient:
Send rpc message HeartbeatFromWorker(21.102.91.135,38565,35711,40333,42329,
Stream(DiskInfo(maxSlots: 0, committed shuffles 0, running applications 0,
shuffleAllocations: Map(), mountPoint: HDFS, usableSpace: 8.0 EiB, totalSpace:
0.0 B , avgFlushTime: 999999 ns, avgFetchTime: 999999 ns, activeSlots: 0,
storageType: HDFS) status: HEALTHY dirs , ?),{},[],{},
false,WorkerStatus{state= Normal,
stateStartTime=1737448232150},688489b2-4a4b-4840-bb9f-78bfd46031b8#54)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]