[incubator-celeborn] branch main updated: [CELEBORN-1085] Update metrics doc

zhouky Tue, 24 Oct 2023 06:43:08 -0700

This is an automated email from the ASF dual-hosted git repository.

zhouky pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/incubator-celeborn.git



The following commit(s) were added to refs/heads/main by this push:
     new e6a71f564 [CELEBORN-1085] Update metrics doc
e6a71f564 is described below

commit e6a71f56407cbc472037b4b4080f6703adb3ea16
Author: onebox-li <[email protected]>
AuthorDate: Tue Oct 24 21:42:51 2023 +0800

    [CELEBORN-1085] Update metrics doc
    
    ### What changes were proposed in this pull request?
    Update metrics doc.
    
    ### Why are the changes needed?
    Ditto
    
    ### Does this PR introduce _any_ user-facing change?
    Doc updated.
    
    ### How was this patch tested?
    No.
    
    Closes #2035 from onebox-li/update-metrics-doc.
    
    Authored-by: onebox-li <[email protected]>
    Signed-off-by: zky.zhoukeyong <[email protected]>
---
 METRICS.md | 94 ++++++++++++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 73 insertions(+), 21 deletions(-)

diff --git a/METRICS.md b/METRICS.md
index 9db43e1c7..61d7313ae 100644
--- a/METRICS.md
+++ b/METRICS.md
@@ -5,13 +5,12 @@ issue or monitor Celeborn cluster.
 
 ## Prerequisites
 
-1.Enable Celeborn metrics.
-set celeborn.metrics.enabled = true  
-2.You need to install prometheus(https://prometheus.io/)  
-We provide an example for prometheus config file
+1.Enable Celeborn metrics. Set configuration `celeborn.metrics.enabled` to 
true (true by default).
+
+2.Install Prometheus (https://prometheus.io/). We provide an example for 
Prometheus config file:
 
 ```yaml
-# prometheus example config
+# Prometheus example config
 global:
   scrape_interval: 15s
   evaluation_interval: 15s
@@ -24,15 +23,17 @@ scrape_configs:
       - targets: [ 
"master-ip:9098","worker1-ip:9096","worker2-ip:9096","worker3-ip:9096","worker4-ip:9096"
 ]
 ```
 
-3.You need to install Grafana server(https://grafana.com/grafana/download)
+3.Install Grafana server (https://grafana.com/grafana/download).
+
+4.Import Celeborn dashboard into Grafana.
 
-4.Import Celeborn dashboard into grafana.
-You can find Celeborn dashboard at assets/grafana/celeborn-dashboard.json.
+You can find the Celeborn dashboard templates under the `assets/grafana` 
directory.
+`celeborn-dashboard.json` displays Celeborn internal metrics and 
`celeborn-jvm-dashboard.json` displays Celeborn JVM related metrics.
 
 ### Optional
 
 We recommend you to install node exporter 
(https://github.com/prometheus/node_exporter)
-on every host, and configure prometheus to scrape information about the host.
+on every host, and configure Prometheus to scrape information about the host.
 Grafana will need a dashboard (dashboard id:8919) to display host details.
 
 ```yaml
@@ -51,17 +52,19 @@ scrape_configs:
       - targets: [ 
"master-ip:9100","worker1-ip:9100","worker2-ip:9100","worker3-ip:9100","worker4-ip:9100"
 ]
 ```
 
-Here is an example of grafana dashboard importing.
+### Import Dashboard Steps
+Here is an example of Grafana dashboard importing.
+
 ![g1](assets/img/g1.png)
 ![g2](assets/img/g2.png)
 ![g3](assets/img/g3.png)
-![g4](assets/img/g4.png)
-![g6](assets/img/g6.png)
-![g5](assets/img/g5.png)
+<img alt="g4" height="90%" src="assets/img/g4.png" width="90%"/>
+<img alt="g6" height="90%" src="assets/img/g6.png" width="90%"/>
+<img alt="g5" height="90%" src="assets/img/g5.png" width="90%"/>
 
 ## Details
 
-|               MetricName               |       Role        |                 
                                  Description                                   
                |
+|               MetricName               |       Scope       |                 
                                  Description                                   
                |
 
|:--------------------------------------:|:-----------------:|:---------------------------------------------------------------------------------------------------------------:|
 |              WorkerCount               |      master       |                 
                         The count of active workers.                           
                |
 |          ExcludedWorkerCount           |      master       |                 
                    The count of workers in excluded list.                      
                |
@@ -70,19 +73,23 @@ Here is an example of grafana dashboard importing.
 |             PartitionSize              |      master       |          The 
estimated partition size of last 20 flush window whose length is 15 seconds by 
defaults.           |
 |            PartitionWritten            |      master       |                 
                           The active shuffle size.                             
                |
 |           PartitionFileCount           |      master       |                 
                      The active shuffle partition count.                       
                |
+|             diskFileCount              |      master       |                 
               The count of disk files consumption by each user.                
                |
+|            diskBytesWritten            |      master       |                 
              The amount of disk files consumption by each user.                
                |
+|             hdfsFileCount              |      master       |                 
               The count of hdfs files consumption by each user.                
                |
+|            hdfsBytesWritten            |      master       |                 
              The amount of hdfs files consumption by each user.                
                |
 |         RegisteredShuffleCount         | master and worker |                 
                 The value means count of registered shuffle.                   
                |
 |            CommitFilesTime             |      worker       |                 
          CommitFiles means flush and close a shuffle partition file.           
                |
 |            ReserveSlotsTime            |      worker       |                 
    ReserveSlots means acquire a disk buffer and record partition location.     
                |
 |             FlushDataTime              |      worker       |                 
                 FlushData means flush a disk buffer to disk.                   
                |
 |             OpenStreamTime             |      worker       |            
OpenStream means read a shuffle file and send client about chunks size and 
stream index.             |
 |             FetchChunkTime             |      worker       |                 
     FetchChunk means read a chunk from a shuffle file and send to client.      
                |
-|           PrimaryPushDataTime          |      worker       |                 
      PrimaryPushData means handle pushdata of primary partition location.      
                |
-|           ReplicaPushDataTime          |      worker       |                 
       ReplicaPushData means handle pushdata of replica partition location.     
                |
+|          PrimaryPushDataTime           |      worker       |                 
     PrimaryPushData means handle pushdata of primary partition location.       
                |
+|          ReplicaPushDataTime           |      worker       |                 
     ReplicaPushData means handle pushdata of replica partition location.       
                |
 |           WriteDataFailCount           |      worker       |                 
   The count of writing PushData or PushMergedData failed in current worker.    
                |
 |         ReplicateDataFailCount         |      worker       |                 
 The count of replicating PushData or PushMergedData failed in current worker.  
                |
 |      ReplicateDataWriteFailCount       |      worker       |       The count 
of replicating PushData or PushMergedData failed caused by write failure in 
peer worker.        |
 | ReplicateDataCreateConnectionFailCount |      worker       | The count of 
replicating PushData or PushMergedData failed caused by creating connection 
failed in peer worker. |
-| ReplicateDataConnectionExceptionCount  |      worker       |    The count of 
replicating PushData or PushMergedData failed caused by connection exception in 
peer worker.    | 
+| ReplicateDataConnectionExceptionCount  |      worker       |    The count of 
replicating PushData or PushMergedData failed caused by connection exception in 
peer worker.    |
 |       ReplicateDataTimeoutCount        |      worker       |        The 
count of replicating PushData or PushMergedData failed caused by push timeout 
in peer worker.        |
 |             TakeBufferTime             |      worker       |                 
             TakeBuffer means get a disk buffer from disk flusher.              
                |
 |             SlotsAllocated             |      worker       |                 
                         Slots allocated in last hour                           
                |
@@ -97,15 +104,60 @@ Here is an example of grafana dashboard importing.
 |       PausePushDataAndReplicate        |      worker       |    
PausePushDataAndReplicate means the count of worker stopped receiving data from 
client and other workers.    |
 |           ActiveShuffleSize            |      worker       |                 
The active shuffle size of a worker including master replica and slave replica. 
                |
 |         ActiveShuffleFileCount         |      worker       |              
The active shuffle file count of a worker including master replica and slave 
replica.              |
+|              jvm_gc_count              |        JVM        |                 
                    The GC count of each garbage collector.                     
                |
+|              jvm_gc_time               |        JVM        |                 
                  The GC cost time of each garbage collector.                   
                |
+|          jvm_memory_heap_init          |        JVM        |                 
                        The amount of heap init memory.                         
                |
+|          jvm_memory_heap_max           |        JVM        |                 
                        The amount of heap max memory.                          
                |
+|          jvm_memory_heap_used          |        JVM        |                 
                        The amount of heap used memory.                         
                |
+|       jvm_memory_heap_committed        |        JVM        |                 
                     The amount of heap committed memory.                       
                |
+|         jvm_memory_heap_usage          |        JVM        |                 
                     The percentage of heap memory usage.                       
                |
+|        jvm_memory_non_heap_init        |        JVM        |                 
                      The amount of non-heap init memory.                       
                |
+|        jvm_memory_non_heap_max         |        JVM        |                 
                      The amount of non-heap max memory.                        
                |
+|        jvm_memory_non_heap_used        |        JVM        |                 
                      The amount of non-heap uesd memory.                       
                |
+|     jvm_memory_non_heap_committed      |        JVM        |                 
                   The amount of non-heap committed memory.                     
                |
+|       jvm_memory_non_heap_usage        |        JVM        |                 
                   The percentage of non-heap memory usage.                     
                |
+|         jvm_memory_pools_init          |        JVM        |                 
                 The amount of each memory pool's init memory.                  
                |
+|          jvm_memory_pools_max          |        JVM        |                 
                 The amount of each memory pool's max memory.                   
                |
+|         jvm_memory_pools_used          |        JVM        |                 
                 The amount of each memory pool's used memory.                  
                |
+|       jvm_memory_pools_committed       |        JVM        |                 
              The amount of each memory pool's committed memory.                
                |
+|     jvm_memory_pools_used_after_gc     |        JVM        |                 
            The amount of each memory pool's used memory after GC.              
                |
+|         jvm_memory_pools_usage         |        JVM        |                 
              The percentage of each memory pool's memory usage.                
                |
+|         jvm_memory_total_init          |        JVM        |                 
                       The amount of total init memory.                         
                |
+|          jvm_memory_total_max          |        JVM        |                 
                        The amount of total max memory.                         
                |
+|         jvm_memory_total_used          |        JVM        |                 
                       The amount of total used memory.                         
                |
+|       jvm_memory_total_committed       |        JVM        |                 
              The amount of each memory pool's committed memory.                
                |
+|          jvm_direct_capacity           |        JVM        |                 
         An estimate of the total capacity of the buffers in this pool          
                |
+|            jvm_direct_count            |        JVM        |                 
               An estimate of the number of buffers in the pool                 
                |
+|            jvm_direct_used             |        JVM        |                 
       An estimate of the memory that JVM is using for this buffer pool         
                |
+|          jvm_mapped_capacity           |        JVM        |                 
         An estimate of the total capacity of the buffers in this pool          
                |
+|            jvm_mapped_count            |        JVM        |                 
               An estimate of the number of buffers in the pool                 
                |
+|            jvm_mapped_used             |        JVM        |                 
       An estimate of the memory that JVM is using for this buffer pool         
                |
+|            jvm_thread_count            |        JVM        |                 
                        The current number of threads.                          
                |
+|        jvm_thread_daemon_count         |        JVM        |                 
                     The current number of daemon threads.                      
                |
+|        jvm_thread_blocked_count        |        JVM        |                 
              The current number of threads having blocked state.               
                |
+|       jvm_thread_deadlock_count        |        JVM        |                 
             The current number of threads having deadlock state.               
                |
+|          jvm_thread_new_count          |        JVM        |                 
                The current number of threads having new state.                 
                |
+|       jvm_thread_runnable_count        |        JVM        |                 
             The current number of threads having runnable state.               
                |
+|      jvm_thread_terminated_count       |        JVM        |                 
            The current number of threads having terminated state.              
                |
+|     jvm_thread_timed_waiting_count     |        JVM        |                 
           The current number of threads having timed_waiting state.            
                |
+|        jvm_thread_waiting_count        |        JVM        |                 
              The current number of threads having waiting state.               
                |
+|               JVMCPUTime               |      system       |                 
                            The JVM costs cpu time.                             
                |
+|          AvailableProcessors           |      system       |                 
                  The amount of system available processors.                    
                |
+|          LastMinuteSystemLoad          |      system       |                 
                        The last minute load of system.                         
                |
 
 ## Implementation
 
-Celeborn master metric : 
`org/apache/celeborn/service/deploy/master/MasterSource.scala`
-Celeborn worker metric : 
`org/apache/celeborn/service/deploy/worker/WorkerSource.scala`
+Celeborn master metrics : 
`org/apache/celeborn/service/deploy/master/MasterSource.scala`.
+
+Celeborn worker metrics : 
`org/apache/celeborn/service/deploy/worker/WorkerSource.scala`.
 
-## Grafana Dashboard
+Other common metrics are implemented in 
`org.apache.celeborn.common.metrics.source` package.
+
+## Dashboard Snapshots
+
+The dashboard [Celeborn-dashboard](assets/grafana/celeborn-dashboard.json) was 
generated by Grafana of version 10.0.3.
 
-We provide a grafana dashboard for Celeborn 
[Grafana-Dashboard](assets/grafana/celeborn-dashboard.json). The dashboard was 
generated by grafana of version 9.4.1.
 Here are some snapshots:
+
 ![d1](assets/img/dashboard1.png)
 ![d2](assets/img/dashboard_full.webp)

[incubator-celeborn] branch main updated: [CELEBORN-1085] Update metrics doc

Reply via email to