[ 
https://issues.apache.org/jira/browse/HDDS-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elek, Marton updated HDDS-1935:
-------------------------------
    Status: Patch Available  (was: Open)

> Improve the visibility with Ozone Insight tool
> ----------------------------------------------
>
>                 Key: HDDS-1935
>                 URL: https://issues.apache.org/jira/browse/HDDS-1935
>             Project: Hadoop Distributed Data Store
>          Issue Type: New Feature
>            Reporter: Elek, Marton
>            Assignee: Elek, Marton
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Visibility is a key aspect for the operation of any Ozone cluster. We need 
> better visibility to improve correctnes and performance. While the 
> distributed tracing is a good tool for improving the visibility of 
> performance we have no powerful tool which can be used to check the internal 
> state of the Ozone cluster and debug certain correctness issues.
> To improve the visibility of the internal components I propose to introduce a 
> new command line application `ozone insight`.
> The new tool will show the selected metrics / logs / configuration for any of 
> the internal components (like replication-manager, pipeline, etc.).
> For each insight points we can define the required logs and log levels, 
> metrics and configuration and the tool can display only the component 
> specific information during the debug.
> h2. Usage
> First we can check the available insight point:
> {code}
> bash-4.2$ ozone insight list
> Available insight points:
>   scm.node-manager                     SCM Datanode management related 
> information.
>   scm.replica-manager                  SCM closed container replication 
> manager
>   scm.event-queue                      Information about the internal async 
> event delivery
>   scm.protocol.block-location          SCM Block location protocol endpoint
>   scm.protocol.container-location      Planned insight point which is not yet 
> implemented.
>   scm.protocol.datanode                Planned insight point which is not yet 
> implemented.
>   scm.protocol.security                Planned insight point which is not yet 
> implemented.
>   scm.http                             Planned insight point which is not yet 
> implemented.
>   om.key-manager                       OM Key Manager
>   om.protocol.client                   Ozone Manager RPC endpoint
>   om.http                              Planned insight point which is not yet 
> implemented.
>   datanode.pipeline[id]                More information about one ratis 
> datanode ring.
>   datanode.rocksdb                     More information about one ratis 
> datanode ring.
>   s3g.http                             Planned insight point which is not yet 
> implemented.
> {code}
> Insight points can define configuration, metrics and/or logs. Configuration 
> can be displayed based on the configuration objects:
> {code}
> ozone insight config scm.protocol.block-location
> Configuration for `scm.protocol.block-location` (SCM Block location protocol 
> endpoint)
> >>> ozone.scm.block.client.bind.host
>        default: 0.0.0.0
>        current: 0.0.0.0
> The hostname or IP address used by the SCM block client  endpoint to bind
> >>> ozone.scm.block.client.port
>        default: 9863
>        current: 9863
> The port number of the Ozone SCM block client service.
> >>> ozone.scm.block.client.address
>        default: ${ozone.scm.client.address}
>        current: scm
> The address of the Ozone SCM block client service. If not defined value of 
> ozone.scm.client.address is used
> {code}
> Metrics can be retrieved from the prometheus entrypoint:
> {code}
> ozone insight metrics scm.protocol.block-location
> Metrics for `scm.protocol.block-location` (SCM Block location protocol 
> endpoint)
> RPC connections
>   Open connections: 0
>   Dropped connections: 0
>   Received bytes: 0
>   Sent bytes: 0
> RPC queue
>   RPC average queue time: 0.0
>   RPC call queue length: 0
> RPC performance
>   RPC processing time average: 0.0
>   Number of slow calls: 0
> Message type counters
>   Number of AllocateScmBlock: 0
>   Number of DeleteScmKeyBlocks: 0
>   Number of GetScmInfo: 2
>   Number of SortDatanodes: 0
> {code}
> Log levels can be adjusted with the existing logLevel servlet and can be 
> collected / streamd via a simple logstream servlet:
> {code}
> ozone insight log scm.node-manager
> [SCM] 2019-08-08 12:42:37,392 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:43:37,392 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:44:37,392 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:45:37,393 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:46:37,392 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> {code}
> The verbose mode can display the raw messages as well:
> {code}
> [SCM] 2019-08-08 13:16:37,398 
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] 
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 13:16:37,400 
> [TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is 
> received from [datanode=ozone_datanode_1.ozone_default]: 
> storageReport {
>   storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
>   storageLocation: "/data/hdds"
>   capacity: 470282264576
>   scmUsed: 16384
>   remaining: 205695963136
>   storageType: DISK
>   failed: false
> }
> {code}
> h2. Use cases
> Ozone insight can be used for any kind of debuging. Some problem examples 
> from my yesterday
>  1. Due to a cache problem the volumes were created twice without any error 
> at the second time. With this tool I can check the state of the internal 
> cache, or check if the volume is added to the rocksdb itself.
>  2. After fixing this problem we found an DNS caching issue. The OM responded 
> with an error but it was not clear where the error was propagated from (it 
> was created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With 
> checking the traffic between SCM and OM it can be easy to track the origin of 
> a specific error.
>  
>  4. After fixing this problem we found some pipline problem (reported later 
> at HDDS-1933). With this tool I could check the content of the reports and 
> messages to the pipeline manager.
>  
> h2. Implementation
> We can implement the tool without any significant code change as it uses 
> existing features:
>  * Metrics can be downloaded from the `/prom` endpoint
>  * Log Level can be set with the existing `/logLevel` servlet endpoint (from 
> hadoop-common)
>  * Log lines can be streamed with a very simple new servlet
>  * Configuration can be displayed based on configuration points
> A new interface can be introduced for `InsightPoint`s where all the affected 
> logs/levels, metrics and config classes can be defined for each components.
> Prometheus servlet endpoint can be changed to be turned on by default.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to