[I] [BUG] title Worker not getting resources (Slots) [celeborn]

via GitHub Tue, 21 Jan 2025 01:31:15 -0800


yangbin09 opened a new issue, #3075:
URL: https://github.com/apache/celeborn/issues/3075


   ---
   
   ### Description:
   
   After starting the Celeborn Worker, the following issues occur:
   
   1. **Worker not getting resources (Slots)**
      The logs show:
      ```
      maxSlots: 0
      activeSlots: 0
      ```
      This indicates that the Worker is not being allocated any resources, 
preventing it from executing any tasks.
   
   2. **Incorrect disk space information**
      In the heartbeat message sent from the Worker to the Master, the disk 
space shows:
      ```
      usableSpace: 8.0 EiB
      totalSpace: 0.0 B
      ```
      This suggests that the Worker is not correctly detecting the disk space, 
which may indicate a problem with the storage path or mount.
   
   3. **No running Shuffle tasks**
      The logs show:
      ```
      committed shuffles: 0
      running applications: 0
      ```
      This indicates that no tasks are being submitted to the Worker, likely 
due to insufficient resources or failed shuffle allocation.
   
   4. **Memory management is normal, but no Shuffle operations are taking 
place**
      Even though memory usage is normal:
      ```
      Direct memory usage: 4.0 MiB/1024.0 MiB
      ```
      No shuffle operations are occurring due to the lack of available 
resources.
   
   ### Steps to Reproduce:
   
   1. Start the Celeborn Worker.
   2. Start a Spark job that performs a shuffle operation.
   3. Check the logs of the Celeborn Worker, paying attention to the 
`maxSlots`, `activeSlots`, `usableSpace`, `totalSpace`, and `committed 
shuffles` fields.
   
   ### Log Summary:
   
   Here are the relevant log entries:
   ```
   25/01/21 16:57:32,635 DEBUG [worker-disk-checker] LocalDeviceMonitor: Device 
check start
   25/01/21 16:57:34,108 INFO [worker-memory-manager-reporter] MemoryManager: 
Direct memory usage: 4.0 MiB/1024.0 MiB, disk buffer size: 0.0 B, sort memory 
size: 0.0 B, read buffer size: 0.0 B, memory file storage size: 0.0 B
   25/01/21 16:57:37,897 INFO [worker-forward-message-scheduler] 
StorageManager: Updated diskInfos:
   25/01/21 16:57:37,898 DEBUG [worker-forward-message-scheduler] MasterClient: 
Send rpc message HeartbeatFromWorker(21.102.91.135,38565,35711,40333,42329, 
Stream(DiskInfo(maxSlots: 0, committed shuffles 0, running applications 0, 
shuffleAllocations: Map(), mountPoint: HDFS, usableSpace: 8.0 EiB, totalSpace: 
0.0 B , avgFlushTime: 999999 ns, avgFetchTime: 999999 ns, activeSlots: 0, 
storageType: HDFS) status: HEALTHY dirs , ?),{},[],{}, 
false,WorkerStatus{state= Normal, 
stateStartTime=1737448232150},688489b2-4a4b-4840-bb9f-78bfd46031b8#54)
   ```
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [BUG] title Worker not getting resources (Slots) [celeborn]

Reply via email to