turboFei commented on code in PR #2623:
URL: https://github.com/apache/celeborn/pull/2623#discussion_r1678767729


##########
docs/monitoring.md:
##########
@@ -127,181 +131,171 @@ These metrics are exposed by Celeborn master.
     - **notes:**
         - This metrics data is generated for each user and they are identified 
using a metric tag.
         - This metrics also include subResourceConsumptions generated for each 
application of user and they are identified using `applicationId` tag.
-    - diskFileCount
-    - diskBytesWritten
-    - hdfsFileCount
-    - hdfsBytesWritten
+    
+    | Metric Name       | Description                                         |
+    |-------------------|-----------------------------------------------------|
+    | diskFileCount     | The count of disk files consumption by each user.   |
+    | diskBytesWritten  | The amount of disk files consumption by each user.  |
+    | hdfsFileCount     | The count of hdfs files consumption by each user.   |
+    | hdfsBytesWritten  | The amount of hdfs files consumption by each user.  |
 
   - namespace=ThreadPool
     - **notes:**
         - This metrics data is generated for each thread pool and they are 
identified using a metric tag by thread pool name.
-    - active_thread_count
-    - pending_task_count
-    - pool_size
-    - core_pool_size
-    - maximum_pool_size
-    - largest_pool_size
-    - is_terminating
-    - is_terminated
-    - is_shutdown
-    - thread_count
-    - thread_is_terminated_count
-    - thread_is_shutdown_count
+    
+    | Metric Name                  | Description                               
                                                                                
  |
+    
|------------------------------|-----------------------------------------------------------------------------------------------------------------------------|
+    | active_thread_count          | The approximate number of threads that 
are actively executing tasks.                                                   
     |
+    | pending_task_count           | The pending task not executed in block 
queue.                                                                          
     |
+    | pool_size                    | The current number of threads in the 
pool.                                                                           
       |
+    | core_pool_size               | The core number of threads.               
                                                                                
  |
+    | maximum_pool_size            | The maximum allowed number of threads.    
                                                                                
  |
+    | largest_pool_size            | The largest number of threads that have 
ever simultaneously been in the pool.                                           
    |
+    | is_terminating               | Whether this executor is in the process 
of terminating after shutdown() or shutdownNow() but has not completely 
terminated. |
+    | is_terminated                | Whether this executor is in the process 
of terminated after shutdown() or shutdownNow() and has completely terminated.  
    |
+    | is_shutdown                  | Whether this executor is shutdown.        
                                                                                
  |
+    | thread_count                 | The thread count of current thread group. 
                                                                                
  |
+    | thread_is_terminated_count   | The terminated thread count of current 
thread group.                                                                   
     |
+    | thread_is_shutdown_count     | The shutdown thread count of current 
thread group.                                                                   
       |
 
 #### Worker
 These metrics are exposed by Celeborn worker.
 
   - namespace=worker
-    - RegisteredShuffleCount
-    - RunningApplicationCount
-    - ActiveShuffleSize
-        - The active shuffle size of a worker including master replica and 
slave replica.
-    - ActiveShuffleFileCount
-        - The active shuffle file count of a worker including master replica 
and slave replica.
-    - OpenStreamTime
-        - The time for a worker to process openStream RPC and return 
StreamHandle.
-    - FetchChunkTime
-        - The time for a worker to fetch a chunk which is 8MB by default from 
a reduced partition. 
-    - ActiveChunkStreamCount
-        - Active stream count for reduce partition reading streams.
-    - OpenStreamSuccessCount
-    - OpenStreamFailCount
-    - FetchChunkSuccessCount
-    - FetchChunkFailCount
-    - PrimaryPushDataTime
-        - The time for a worker to handle a pushData RPC sent from a celeborn 
client.
-    - ReplicaPushDataTime
-        - The time for a worker to handle a pushData RPC sent from a celeborn 
worker by replicating.
-    - WriteDataHardSplitCount
-    - WriteDataSuccessCount
-    - WriteDataFailCount
-    - ReplicateDataFailCount
-    - ReplicateDataWriteFailCount
-    - ReplicateDataCreateConnectionFailCount
-    - ReplicateDataConnectionExceptionCount
-    - ReplicateDataFailNonCriticalCauseCount
-    - ReplicateDataTimeoutCount
-    - PushDataHandshakeFailCount
-    - RegionStartFailCount
-    - RegionFinishFailCount
-    - PrimaryPushDataHandshakeTime
-    - ReplicaPushDataHandshakeTime
-    - PrimaryRegionStartTime
-    - ReplicaRegionStartTime
-    - PrimaryRegionFinishTime
-    - ReplicaRegionFinishTime
-    - PausePushDataTime
-        - The time for a worker to stop receiving pushData from clients 
because of back pressure.
-    - PausePushDataAndReplicateTime
-        - The time for a worker to stop receiving pushData from clients and 
other workers because of back pressure.
-    - PausePushData
-        - The count for a worker to stop receiving pushData from clients 
because of back pressure.
-    - PausePushDataAndReplicate
-        - The count for a worker to stop receiving pushData from clients and 
other workers because of back pressure.
-    - TakeBufferTime
-        - The time for a worker to take out a buffer from a disk flusher.
-    - FlushDataTime
-        - The time for a worker to write a buffer which is 256KB by default to 
storage.
-    - CommitFilesTime
-        - The time for a worker to flush buffers and close files related to 
specified shuffle.
-    - SlotsAllocated
-    - ActiveSlotsCount
-        - The number of slots currently being used in a worker 
-    - ReserveSlotsTime
-    - ActiveConnectionCount
-    - NettyMemory
-        - The total amount of off-heap memory used by celeborn worker.
-    - SortTime
-        - The time for a worker to sort a shuffle file.
-    - SortMemory
-        - The memory used by sorting shuffle files.
-    - SortingFiles
-    - SortedFiles
-    - SortedFileSize
-    - DiskBuffer
-        - The memory occupied by pushData and pushMergedData which should be 
written to disk.
-    - BufferStreamReadBuffer
-        - The memory used by credit stream read buffer.
-    - ReadBufferDispatcherRequestsLength
-        - The queue size of read buffer allocation requests.
-    - ReadBufferAllocatedCount
-        - Allocated read buffer count.
-    - ActiveCreditStreamCount
-        - Active stream count for map partition reading streams.
-    - ActiveMapPartitionCount
-    - CleanTaskQueueSize
-    - CleanExpiredShuffleKeysTime
-        - The time for a worker to clean up shuffle data of expired shuffle 
keys.
-    - DeviceOSFreeBytes
-    - DeviceOSTotalBytes
-    - DeviceCelebornFreeBytes
-    - DeviceCelebornTotalBytes
-    - PotentialConsumeSpeed
-    - UserProduceSpeed
-    - WorkerConsumeSpeed
-    - IsDecommissioningWorker
-    - push_server_usedHeapMemory 
-    - push_server_usedDirectMemory
-    - push_server_numAllocations 
-    - push_server_numTinyAllocations
-    - push_server_numSmallAllocations
-    - push_server_numNormalAllocations
-    - push_server_numHugeAllocations
-    - push_server_numDeallocations
-    - push_server_numTinyDeallocations
-    - push_server_numSmallDeallocations
-    - push_server_numNormalDeallocations
-    - push_server_numHugeDeallocations
-    - push_server_numActiveAllocations
-    - push_server_numActiveTinyAllocations
-    - push_server_numActiveSmallAllocations
-    - push_server_numActiveNormalAllocations
-    - push_server_numActiveHugeAllocations
-    - push_server_numActiveBytes
-    - replicate_server_usedHeapMemory
-    - replicate_server_usedDirectMemory
-    - replicate_server_numAllocations 
-    - replicate_server_numTinyAllocations
-    - replicate_server_numSmallAllocations
-    - replicate_server_numNormalAllocations
-    - replicate_server_numHugeAllocations
-    - replicate_server_numDeallocations
-    - replicate_server_numTinyDeallocations
-    - replicate_server_numSmallDeallocations
-    - replicate_server_numNormalDeallocations
-    - replicate_server_numHugeDeallocations
-    - replicate_server_numActiveAllocations
-    - replicate_server_numActiveTinyAllocations
-    - replicate_server_numActiveSmallAllocations
-    - replicate_server_numActiveNormalAllocations
-    - replicate_server_numActiveHugeAllocations
-    - replicate_server_numActiveBytes
-    - fetch_server_usedHeapMemory
-    - fetch_server_usedDirectMemory
-    - fetch_server_numAllocations 
-    - fetch_server_numTinyAllocations
-    - fetch_server_numSmallAllocations
-    - fetch_server_numNormalAllocations
-    - fetch_server_numHugeAllocations
-    - fetch_server_numDeallocations
-    - fetch_server_numTinyDeallocations
-    - fetch_server_numSmallDeallocations
-    - fetch_server_numNormalDeallocations
-    - fetch_server_numHugeDeallocations
-    - fetch_server_numActiveAllocations
-    - fetch_server_numActiveTinyAllocations
-    - fetch_server_numActiveSmallAllocations
-    - fetch_server_numActiveNormalAllocations
-    - fetch_server_numActiveHugeAllocations
-    - fetch_server_numActiveBytes
+    
+    | Metric Name                                 | Description                
                                                                                
     |
+    
|---------------------------------------------|-----------------------------------------------------------------------------------------------------------------|
+    | RegisteredShuffleCount                      | The count of registered 
shuffle.                                                                        
        |
+    | RunningApplicationCount                     | The count of running 
applications.                                                                   
           |
+    | ActiveShuffleSize                           | The active shuffle size of 
a worker including master replica and slave replica.                            
     |
+    | ActiveShuffleFileCount                      | The active shuffle file 
count of a worker including master replica and slave replica.                   
        |
+    | OpenStreamTime                              | The time for a worker to 
process openStream RPC and return StreamHandle.                                 
       |
+    | FetchChunkTime                              | The time for a worker to 
fetch a chunk which is 8MB by default from a reduced partition.                 
       |
+    | ActiveChunkStreamCount                      | Active stream count for 
reduce partition reading streams.                                               
        |
+    | OpenStreamSuccessCount                      | The count of opening 
stream succeed in current worker.                                               
           |
+    | OpenStreamFailCount                         | The count of opening 
stream failed in current worker.                                                
           |
+    | FetchChunkSuccessCount                      | The count of fetching 
chunk succeed in current worker.                                                
          |
+    | FetchChunkFailCount                         | The count of fetching 
chunk failed in current worker.                                                 
          |
+    | PrimaryPushDataTime                         | The time for a worker to 
handle a pushData RPC sent from a celeborn client.                              
       |
+    | ReplicaPushDataTime                         | The time for a worker to 
handle a pushData RPC sent from a celeborn worker by replicating.               
       |
+    | WriteDataHardSplitCount                     | The count of writing 
PushData or PushMergedData to HARD_SPLIT partition in current worker.           
           |
+    | WriteDataSuccessCount                       | The count of writing 
PushData or PushMergedData succeed in current worker.                           
           |
+    | WriteDataFailCount                          | The count of writing 
PushData or PushMergedData failed in current worker.                            
           |
+    | ReplicateDataFailCount                      | The count of replicating 
PushData or PushMergedData failed in current worker.                            
       |
+    | ReplicateDataWriteFailCount                 | The count of replicating 
PushData or PushMergedData failed caused by write failure in peer worker.       
       |
+    | ReplicateDataCreateConnectionFailCount      | The count of replicating 
PushData or PushMergedData failed caused by creating connection failed in peer 
worker. |
+    | ReplicateDataConnectionExceptionCount       | The count of replicating 
PushData or PushMergedData failed caused by connection exception in peer 
worker.       |
+    | ReplicateDataFailNonCriticalCauseCount      | The count of replicating 
PushData or PushMergedData failed caused by non-critical exception in peer 
worker.     |
+    | ReplicateDataTimeoutCount                   | The count of replicating 
PushData or PushMergedData failed caused by push timeout in peer worker.        
       |
+    | PushDataHandshakeFailCount                  | The count of 
PushDataHandshake failed in current worker.                                     
                   |
+    | RegionStartFailCount                        | The count of RegionStart 
failed in current worker.                                                       
       |
+    | RegionFinishFailCount                       | The count of RegionFinish 
failed in current worker.                                                       
      |
+    | PrimaryPushDataHandshakeTime                | PrimaryPushDataHandshake 
means handle PushData of primary partition location.                            
       |
+    | ReplicaPushDataHandshakeTime                | ReplicaPushDataHandshake 
means handle PushData of replica partition location.                            
       |
+    | PrimaryRegionStartTime                      | PrimaryRegionStart means 
handle RegionStart of primary partition location.                               
       |
+    | ReplicaRegionStartTime                      | ReplicaRegionStart means 
handle RegionStart of replica partition location.                               
       |
+    | PrimaryRegionFinishTime                     | PrimaryRegionFinish means 
handle RegionFinish of primary partition location.                              
      |
+    | ReplicaRegionFinishTime                     | ReplicaRegionFinish means 
handle RegionFinish of replica partition location.                              
      |
+    | PausePushDataTime                           | The time for a worker to 
stop receiving pushData from clients because of back pressure.                  
       |
+    | PausePushDataAndReplicateTime               | The time for a worker to 
stop receiving pushData from clients and other workers because of back 
pressure.       |
+    | PausePushData                               | The count for a worker to 
stop receiving pushData from clients because of back pressure.                  
      |
+    | PausePushDataAndReplicate                   | The count for a worker to 
stop receiving pushData from clients and other workers because of back 
pressure.      |
+    | TakeBufferTime                              | The time for a worker to 
take out a buffer from a disk flusher.                                          
       |
+    | FlushDataTime                               | The time for a worker to 
write a buffer which is 256KB by default to storage.                            
       |
+    | CommitFilesTime                             | The time for a worker to 
flush buffers and close files related to specified shuffle.                     
       |
+    | SlotsAllocated                              | Slots allocated in last 
hour.                                                                           
        |
+    | ActiveSlotsCount                            | The number of slots 
currently being used in a worker.                                               
            |
+    | ReserveSlotsTime                            | ReserveSlots means acquire 
a disk buffer and record partition location.                                    
     |
+    | ActiveConnectionCount                       | The count of active 
network connection.                                                             
            |
+    | NettyMemory                                 | The total amount of 
off-heap memory used by celeborn worker.                                        
            |
+    | SortTime                                    | The time for a worker to 
sort a shuffle file.                                                            
       |
+    | SortMemory                                  | The memory used by sorting 
shuffle files.                                                                  
     |
+    | SortingFiles                                | The count of sorting 
shuffle files.                                                                  
           |
+    | SortedFiles                                 | The count of sorted 
shuffle files.                                                                  
            |
+    | SortedFileSize                              | The count of sorted 
shuffle files 's total size.                                                    
            |
+    | DiskBuffer                                  | The memory occupied by 
pushData and pushMergedData which should be written to disk.                    
         |
+    | BufferStreamReadBuffer                      | The memory used by credit 
stream read buffer.                                                             
      |
+    | ReadBufferDispatcherRequestsLength          | The queue size of read 
buffer allocation requests.                                                     
         |
+    | ReadBufferAllocatedCount                    | Allocated read buffer 
count.                                                                          
          |
+    | ActiveCreditStreamCount                     | Active stream count for 
map partition reading streams.                                                  
        |
+    | ActiveMapPartitionCount                     | The count of active map 
partition reading streams.                                                      
        |
+    | CleanTaskQueueSize                          | The count of task for 
cleaning up expired shuffle keys.                                               
          |
+    | CleanExpiredShuffleKeysTime                 | The time for a worker to 
clean up shuffle data of expired shuffle keys.                                  
       |
+    | DeviceOSFreeBytes                           | The actual usable space of 
OS for device monitor.                                                          
     |
+    | DeviceOSTotalBytes                          | The total usable space of 
OS for device monitor.                                                          
      |
+    | DeviceCelebornFreeBytes                     | The actual usable space of 
Celeborn for device.                                                            
     |
+    | DeviceCelebornTotalBytes                    | The total space of 
Celeborn for device.                                                            
             |
+    | PotentialConsumeSpeed                       | The speed of potential 
consumption for congestion control.                                             
         |
+    | UserProduceSpeed                            | The speed of user 
production for congestion control.                                              
              |
+    | WorkerConsumeSpeed                          | The speed of worker 
consumption for congestion control.                                             
            |
+    | IsDecommissioningWorker                     | 1 means worker 
decommissioning, 0 means not decommissioning.                                   
                 |
+    | push_server_usedHeapMemory                  |                            
                                                                                
     |
+    | push_server_usedDirectMemory                |                            
                                                                                
     |
+    | push_server_numAllocations                  |                            
                                                                                
     |
+    | push_server_numTinyAllocations              |                            
                                                                                
     |
+    | push_server_numSmallAllocations             |                            
                                                                                
     |
+    | push_server_numNormalAllocations            |                            
                                                                                
     |
+    | push_server_numHugeAllocations              |                            
                                                                                
     |
+    | push_server_numDeallocations                |                            
                                                                                
     |
+    | push_server_numTinyDeallocations            |                            
                                                                                
     |
+    | push_server_numSmallDeallocations           |                            
                                                                                
     |
+    | push_server_numNormalDeallocations          |                            
                                                                                
     |
+    | push_server_numHugeDeallocations            |                            
                                                                                
     |
+    | push_server_numActiveAllocations            |                            
                                                                                
     |
+    | push_server_numActiveTinyAllocations        |                            
                                                                                
     |
+    | push_server_numActiveSmallAllocations       |                            
                                                                                
     |
+    | push_server_numActiveNormalAllocations      |                            
                                                                                
     |
+    | push_server_numActiveHugeAllocations        |                            
                                                                                
     |
+    | push_server_numActiveBytes                  |                            
                                                                                
     |
+    | replicate_server_usedHeapMemory             |                            
                                                                                
     |
+    | replicate_server_usedDirectMemory           |                            
                                                                                
     |
+    | replicate_server_numAllocations             |                            
                                                                                
     |
+    | replicate_server_numTinyAllocations         |                            
                                                                                
     |
+    | replicate_server_numSmallAllocations        |                            
                                                                                
     |
+    | replicate_server_numNormalAllocations       |                            
                                                                                
     |
+    | replicate_server_numHugeAllocations         |                            
                                                                                
     |
+    | replicate_server_numDeallocations           |                            
                                                                                
     |
+    | replicate_server_numTinyDeallocations       |                            
                                                                                
     |
+    | replicate_server_numSmallDeallocations      |                            
                                                                                
     |
+    | replicate_server_numNormalDeallocations     |                            
                                                                                
     |
+    | replicate_server_numHugeDeallocations       |                            
                                                                                
     |
+    | replicate_server_numActiveAllocations       |                            
                                                                                
     |
+    | replicate_server_numActiveTinyAllocations   |                            
                                                                                
     |
+    | replicate_server_numActiveSmallAllocations  |                            
                                                                                
     |
+    | replicate_server_numActiveNormalAllocations |                            
                                                                                
     |
+    | replicate_server_numActiveHugeAllocations   |                            
                                                                                
     |
+    | replicate_server_numActiveBytes             |                            
                                                                                
     |
+    | fetch_server_usedHeapMemory                 |                            
                                                                                
     |
+    | fetch_server_usedDirectMemory               |                            
                                                                                
     |
+    | fetch_server_numAllocations                 |                            
                                                                                
     |
+    | fetch_server_numTinyAllocations             |                            
                                                                                
     |
+    | fetch_server_numSmallAllocations            |                            
                                                                                
     |
+    | fetch_server_numNormalAllocations           |                            
                                                                                
     |
+    | fetch_server_numHugeAllocations             |                            
                                                                                
     |
+    | fetch_server_numDeallocations               |                            
                                                                                
     |
+    | fetch_server_numTinyDeallocations           |                            
                                                                                
     |
+    | fetch_server_numSmallDeallocations          |                            
                                                                                
     |
+    | fetch_server_numNormalDeallocations         |                            
                                                                                
     |
+    | fetch_server_numHugeDeallocations           |                            
                                                                                
     |
+    | fetch_server_numActiveAllocations           |                            
                                                                                
     |
+    | fetch_server_numActiveTinyAllocations       |                            
                                                                                
     |
+    | fetch_server_numActiveSmallAllocations      |                            
                                                                                
     |
+    | fetch_server_numActiveNormalAllocations     |                            
                                                                                
     |
+    | fetch_server_numActiveHugeAllocations       |                            
                                                                                
     |
+    | fetch_server_numActiveBytes                 |                            
                                                                                
     |

Review Comment:
   TODO: complete the description for these items in another PR.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to