[
https://issues.apache.org/jira/browse/HDDS-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Elek, Marton updated HDDS-1935:
-------------------------------
Status: Patch Available (was: Open)
> Improve the visibility with Ozone Insight tool
> ----------------------------------------------
>
> Key: HDDS-1935
> URL: https://issues.apache.org/jira/browse/HDDS-1935
> Project: Hadoop Distributed Data Store
> Issue Type: New Feature
> Reporter: Elek, Marton
> Assignee: Elek, Marton
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Visibility is a key aspect for the operation of any Ozone cluster. We need
> better visibility to improve correctnes and performance. While the
> distributed tracing is a good tool for improving the visibility of
> performance we have no powerful tool which can be used to check the internal
> state of the Ozone cluster and debug certain correctness issues.
> To improve the visibility of the internal components I propose to introduce a
> new command line application `ozone insight`.
> The new tool will show the selected metrics / logs / configuration for any of
> the internal components (like replication-manager, pipeline, etc.).
> For each insight points we can define the required logs and log levels,
> metrics and configuration and the tool can display only the component
> specific information during the debug.
> h2. Usage
> First we can check the available insight point:
> {code}
> bash-4.2$ ozone insight list
> Available insight points:
> scm.node-manager SCM Datanode management related
> information.
> scm.replica-manager SCM closed container replication
> manager
> scm.event-queue Information about the internal async
> event delivery
> scm.protocol.block-location SCM Block location protocol endpoint
> scm.protocol.container-location Planned insight point which is not yet
> implemented.
> scm.protocol.datanode Planned insight point which is not yet
> implemented.
> scm.protocol.security Planned insight point which is not yet
> implemented.
> scm.http Planned insight point which is not yet
> implemented.
> om.key-manager OM Key Manager
> om.protocol.client Ozone Manager RPC endpoint
> om.http Planned insight point which is not yet
> implemented.
> datanode.pipeline[id] More information about one ratis
> datanode ring.
> datanode.rocksdb More information about one ratis
> datanode ring.
> s3g.http Planned insight point which is not yet
> implemented.
> {code}
> Insight points can define configuration, metrics and/or logs. Configuration
> can be displayed based on the configuration objects:
> {code}
> ozone insight config scm.protocol.block-location
> Configuration for `scm.protocol.block-location` (SCM Block location protocol
> endpoint)
> >>> ozone.scm.block.client.bind.host
> default: 0.0.0.0
> current: 0.0.0.0
> The hostname or IP address used by the SCM block client endpoint to bind
> >>> ozone.scm.block.client.port
> default: 9863
> current: 9863
> The port number of the Ozone SCM block client service.
> >>> ozone.scm.block.client.address
> default: ${ozone.scm.client.address}
> current: scm
> The address of the Ozone SCM block client service. If not defined value of
> ozone.scm.client.address is used
> {code}
> Metrics can be retrieved from the prometheus entrypoint:
> {code}
> ozone insight metrics scm.protocol.block-location
> Metrics for `scm.protocol.block-location` (SCM Block location protocol
> endpoint)
> RPC connections
> Open connections: 0
> Dropped connections: 0
> Received bytes: 0
> Sent bytes: 0
> RPC queue
> RPC average queue time: 0.0
> RPC call queue length: 0
> RPC performance
> RPC processing time average: 0.0
> Number of slow calls: 0
> Message type counters
> Number of AllocateScmBlock: 0
> Number of DeleteScmKeyBlocks: 0
> Number of GetScmInfo: 2
> Number of SortDatanodes: 0
> {code}
> Log levels can be adjusted with the existing logLevel servlet and can be
> collected / streamd via a simple logstream servlet:
> {code}
> ozone insight log scm.node-manager
> [SCM] 2019-08-08 12:42:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:43:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:44:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:45:37,393
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:46:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> {code}
> The verbose mode can display the raw messages as well:
> {code}
> [SCM] 2019-08-08 13:16:37,398
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 13:16:37,400
> [TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is
> received from [datanode=ozone_datanode_1.ozone_default]:
> storageReport {
> storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
> storageLocation: "/data/hdds"
> capacity: 470282264576
> scmUsed: 16384
> remaining: 205695963136
> storageType: DISK
> failed: false
> }
> {code}
> h2. Use cases
> Ozone insight can be used for any kind of debuging. Some problem examples
> from my yesterday
> 1. Due to a cache problem the volumes were created twice without any error
> at the second time. With this tool I can check the state of the internal
> cache, or check if the volume is added to the rocksdb itself.
> 2. After fixing this problem we found an DNS caching issue. The OM responded
> with an error but it was not clear where the error was propagated from (it
> was created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With
> checking the traffic between SCM and OM it can be easy to track the origin of
> a specific error.
>
> 4. After fixing this problem we found some pipline problem (reported later
> at HDDS-1933). With this tool I could check the content of the reports and
> messages to the pipeline manager.
>
> h2. Implementation
> We can implement the tool without any significant code change as it uses
> existing features:
> * Metrics can be downloaded from the `/prom` endpoint
> * Log Level can be set with the existing `/logLevel` servlet endpoint (from
> hadoop-common)
> * Log lines can be streamed with a very simple new servlet
> * Configuration can be displayed based on configuration points
> A new interface can be introduced for `InsightPoint`s where all the affected
> logs/levels, metrics and config classes can be defined for each components.
> Prometheus servlet endpoint can be changed to be turned on by default.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]