[
https://issues.apache.org/jira/browse/HDDS-1935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Elek, Marton updated HDDS-1935:
-------------------------------
Description:
Visibility is a key aspect for the operation of any Ozone cluster. We need
better visibility to improve correctnes and performance. While the distributed
tracing is a good tool for improving the visibility of performance we have no
powerful tool which can be used to check the internal state of the Ozone
cluster and debug certain correctness issues.
To improve the visibility of the internal components I propose to introduce a
new command line application `ozone insight`.
The new tool will show the selected metrics / logs / configuration for any of
the internal components (like replication-manager, pipeline, etc.).
For each insight points we can define the required logs and log levels, metrics
and configuration and the tool can display only the component specific
information during the debug.
h2. Usage
First we can check the available insight point:
{code}
bash-4.2$ ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related
information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async
event delivery
scm.protocol.block-location SCM Block location protocol endpoint
scm.protocol.container-location Planned insight point which is not yet
implemented.
scm.protocol.datanode Planned insight point which is not yet
implemented.
scm.protocol.security Planned insight point which is not yet
implemented.
scm.http Planned insight point which is not yet
implemented.
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
om.http Planned insight point which is not yet
implemented.
datanode.pipeline[id] More information about one ratis
datanode ring.
datanode.rocksdb More information about one ratis
datanode ring.
s3g.http Planned insight point which is not yet
implemented.
{code}
Insight points can define configuration, metrics and/or logs. Configuration can
be displayed based on the configuration objects:
{code}
ozone insight config scm.protocol.block-location
Configuration for `scm.protocol.block-location` (SCM Block location protocol
endpoint)
>>> ozone.scm.block.client.bind.host
default: 0.0.0.0
current: 0.0.0.0
The hostname or IP address used by the SCM block client endpoint to bind
>>> ozone.scm.block.client.port
default: 9863
current: 9863
The port number of the Ozone SCM block client service.
>>> ozone.scm.block.client.address
default: ${ozone.scm.client.address}
current: scm
The address of the Ozone SCM block client service. If not defined value of
ozone.scm.client.address is used
{code}
Metrics can be retrieved from the prometheus entrypoint:
{code}
ozone insight metrics scm.protocol.block-location
Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)
RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 0
Sent bytes: 0
RPC queue
RPC average queue time: 0.0
RPC call queue length: 0
RPC performance
RPC processing time average: 0.0
Number of slow calls: 0
Message type counters
Number of AllocateScmBlock: 0
Number of DeleteScmKeyBlocks: 0
Number of GetScmInfo: 2
Number of SortDatanodes: 0
{code}
Log levels can be adjusted with the existing logLevel servlet and can be
collected / streamd via a simple logstream servlet:
{code}
ozone insight log scm.node-manager
[SCM] 2019-08-08 12:42:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:43:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:44:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:45:37,393
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:46:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
{code}
The verbose mode can display the raw messages as well:
{code}
[SCM] 2019-08-08 13:16:37,398
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 13:16:37,400
[TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is
received from [datanode=ozone_datanode_1.ozone_default]:
storageReport {
storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
storageLocation: "/data/hdds"
capacity: 470282264576
scmUsed: 16384
remaining: 205695963136
storageType: DISK
failed: false
}
{code}
h2. Use cases
Ozone insight can be used for any kind of debuging. Some problem examples from
my yesterday
1. Due to a cache problem the volumes were created twice without any error at
the second time. With this tool I can check the state of the internal cache, or
check if the volume is added to the rocksdb itself.
2. After fixing this problem we found an DNS caching issue. The OM responded
with an error but it was not clear where the error was propagated from (it was
created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With
checking the traffic between SCM and OM it can be easy to track the origin of a
specific error.
4. After fixing this problem we found some pipline problem (reported later at
HDDS-1933). With this tool I could check the content of the reports and
messages to the pipeline manager.
h2. Implementation
We can implement the tool without any significant code change as it uses
existing features:
* Metrics can be downloaded from the `/prom` endpoint
* Log Level can be set with the existing `/logLevel` servlet endpoint (from
hadoop-common)
* Log lines can be streamed with a very simple new servlet
* Configuration can be displayed based on configuration points
A new interface can be introduced for `InsightPoint`s where all the affected
logs/levels, metrics and config classes can be defined for each components.
Prometheus servlet endpoint can be changed to be turned on by default.
was:
Visibility is a key aspect for the operation of any Ozone cluster. We need
better visibility to improve correctnes and performance. While the distributed
tracing is a good tool for improving the visibility of performance we have no
powerful tool which can be used to check the internal state of the Ozone
cluster and debug certain correctness issues.
To improve the visibility of the internal components I propose to introduce a
new command line application `ozone insight`.
The new tool will show the selected metrics / logs / configuration for any of
the internal components (like replication-manager, pipeline, etc.).
For each insight points we can define the required logs and log levels, metrics
and configuration and the tool can display only the component specific
information during the debug.
h2. Usage
First we can check the available insight point:
{code}
bash-4.2$ ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related
information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async
event delivery
scm.protocol.block-location SCM Block location protocol endpoint
scm.protocol.container-location Planned insight point which is not yet
implemented.
scm.protocol.datanode Planned insight point which is not yet
implemented.
scm.protocol.security Planned insight point which is not yet
implemented.
scm.http Planned insight point which is not yet
implemented.
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
om.http Planned insight point which is not yet
implemented.
datanode.pipeline[id] More information about one ratis
datanode ring.
datanode.rocksdb More information about one ratis
datanode ring.
s3g.http Planned insight point which is not yet
implemented.
{code}
Insight points can define configuration, metrics and/or logs. Configuration can
be displayed based on the configuration objects:
{code}
ozone insight config scm.protocol.block-location
Configuration for `scm.protocol.block-location` (SCM Block location protocol
endpoint)
>>> ozone.scm.block.client.bind.host
default: 0.0.0.0
current: 0.0.0.0
The hostname or IP address used by the SCM block client endpoint to bind
>>> ozone.scm.block.client.port
default: 9863
current: 9863
The port number of the Ozone SCM block client service.
>>> ozone.scm.block.client.address
default: ${ozone.scm.client.address}
current: scm
The address of the Ozone SCM block client service. If not defined value of
ozone.scm.client.address is used
{code}
Metrics can be retrieved from the prometheus entrypoint:
{code}
ozone insight metrics scm.protocol.block-location
Metrics for `scm.protocol.block-location` (SCM Block location protocol endpoint)
RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 0
Sent bytes: 0
RPC queue
RPC average queue time: 0.0
RPC call queue length: 0
RPC performance
RPC processing time average: 0.0
Number of slow calls: 0
Message type counters
Number of AllocateScmBlock: 0
Number of DeleteScmKeyBlocks: 0
Number of GetScmInfo: 2
Number of SortDatanodes: 0
{code}
Log levels can be adjusted with the existing logLevel servlet and can be
collected / streamd via a simple logstream servlet:
{code}
ozone insight log scm.node-manager
[SCM] 2019-08-08 12:42:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:43:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:44:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:45:37,393
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:46:37,392
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
{code}
The verbose mode can display the raw messages as well:
{code}
[SCM] 2019-08-08 13:16:37,398
[DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 13:16:37,400
[TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is
received from [datanode=ozone_datanode_1.ozone_default]:
storageReport {
storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
storageLocation: "/data/hdds"
capacity: 470282264576
scmUsed: 16384
remaining: 205695963136
storageType: DISK
failed: false
}
{code}
h2. Use cases
Ozone insight can be used for any kind of debuging. Some problem examples from
my yesterday
1. Due to a cache problem the volumes were created twice without any error at
the second time. With this tool I can check the state of the internal cache, or
check if the volume is added to the rocksdb itself.
2. After fixing this problem we found an DNS caching issue. The OM responded
with an error but it was not clear where the error was propagated from (it was
created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With
checking the traffic between SCM and OM it can be easy to track the origin of a
specific error.
4. After fixing this problem we found some pipline problem (reported later at
HDDS-1933). With this tool I could check the content of the reports and
messages to the pipeline manager.
h2. Implementation
We can implement the tool without any significant code change as it uses
existing features:
* Metrics can be downloaded from the `/prom` endpoint
* Log Level can be set with the existing `/logLevel` servlet endpoint (from
hadoop-common)
* Log lines can be streamed with a very simple new servlet
* Configuration can be displayed based on configuration points
A new interface can be introduced for `InsightPoint`s.
> Improve the visibility with Ozone Insight tool
> ----------------------------------------------
>
> Key: HDDS-1935
> URL: https://issues.apache.org/jira/browse/HDDS-1935
> Project: Hadoop Distributed Data Store
> Issue Type: New Feature
> Reporter: Elek, Marton
> Assignee: Elek, Marton
> Priority: Major
>
> Visibility is a key aspect for the operation of any Ozone cluster. We need
> better visibility to improve correctnes and performance. While the
> distributed tracing is a good tool for improving the visibility of
> performance we have no powerful tool which can be used to check the internal
> state of the Ozone cluster and debug certain correctness issues.
> To improve the visibility of the internal components I propose to introduce a
> new command line application `ozone insight`.
> The new tool will show the selected metrics / logs / configuration for any of
> the internal components (like replication-manager, pipeline, etc.).
> For each insight points we can define the required logs and log levels,
> metrics and configuration and the tool can display only the component
> specific information during the debug.
> h2. Usage
> First we can check the available insight point:
> {code}
> bash-4.2$ ozone insight list
> Available insight points:
> scm.node-manager SCM Datanode management related
> information.
> scm.replica-manager SCM closed container replication
> manager
> scm.event-queue Information about the internal async
> event delivery
> scm.protocol.block-location SCM Block location protocol endpoint
> scm.protocol.container-location Planned insight point which is not yet
> implemented.
> scm.protocol.datanode Planned insight point which is not yet
> implemented.
> scm.protocol.security Planned insight point which is not yet
> implemented.
> scm.http Planned insight point which is not yet
> implemented.
> om.key-manager OM Key Manager
> om.protocol.client Ozone Manager RPC endpoint
> om.http Planned insight point which is not yet
> implemented.
> datanode.pipeline[id] More information about one ratis
> datanode ring.
> datanode.rocksdb More information about one ratis
> datanode ring.
> s3g.http Planned insight point which is not yet
> implemented.
> {code}
> Insight points can define configuration, metrics and/or logs. Configuration
> can be displayed based on the configuration objects:
> {code}
> ozone insight config scm.protocol.block-location
> Configuration for `scm.protocol.block-location` (SCM Block location protocol
> endpoint)
> >>> ozone.scm.block.client.bind.host
> default: 0.0.0.0
> current: 0.0.0.0
> The hostname or IP address used by the SCM block client endpoint to bind
> >>> ozone.scm.block.client.port
> default: 9863
> current: 9863
> The port number of the Ozone SCM block client service.
> >>> ozone.scm.block.client.address
> default: ${ozone.scm.client.address}
> current: scm
> The address of the Ozone SCM block client service. If not defined value of
> ozone.scm.client.address is used
> {code}
> Metrics can be retrieved from the prometheus entrypoint:
> {code}
> ozone insight metrics scm.protocol.block-location
> Metrics for `scm.protocol.block-location` (SCM Block location protocol
> endpoint)
> RPC connections
> Open connections: 0
> Dropped connections: 0
> Received bytes: 0
> Sent bytes: 0
> RPC queue
> RPC average queue time: 0.0
> RPC call queue length: 0
> RPC performance
> RPC processing time average: 0.0
> Number of slow calls: 0
> Message type counters
> Number of AllocateScmBlock: 0
> Number of DeleteScmKeyBlocks: 0
> Number of GetScmInfo: 2
> Number of SortDatanodes: 0
> {code}
> Log levels can be adjusted with the existing logLevel servlet and can be
> collected / streamd via a simple logstream servlet:
> {code}
> ozone insight log scm.node-manager
> [SCM] 2019-08-08 12:42:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:43:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:44:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:45:37,393
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 12:46:37,392
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> {code}
> The verbose mode can display the raw messages as well:
> {code}
> [SCM] 2019-08-08 13:16:37,398
> [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager]
> Processing node report from [datanode=ozone_datanode_1.ozone_default]
> [SCM] 2019-08-08 13:16:37,400
> [TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is
> received from [datanode=ozone_datanode_1.ozone_default]:
> storageReport {
> storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
> storageLocation: "/data/hdds"
> capacity: 470282264576
> scmUsed: 16384
> remaining: 205695963136
> storageType: DISK
> failed: false
> }
> {code}
> h2. Use cases
> Ozone insight can be used for any kind of debuging. Some problem examples
> from my yesterday
> 1. Due to a cache problem the volumes were created twice without any error
> at the second time. With this tool I can check the state of the internal
> cache, or check if the volume is added to the rocksdb itself.
> 2. After fixing this problem we found an DNS caching issue. The OM responded
> with an error but it was not clear where the error was propagated from (it
> was created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With
> checking the traffic between SCM and OM it can be easy to track the origin of
> a specific error.
>
> 4. After fixing this problem we found some pipline problem (reported later
> at HDDS-1933). With this tool I could check the content of the reports and
> messages to the pipeline manager.
>
> h2. Implementation
> We can implement the tool without any significant code change as it uses
> existing features:
> * Metrics can be downloaded from the `/prom` endpoint
> * Log Level can be set with the existing `/logLevel` servlet endpoint (from
> hadoop-common)
> * Log lines can be streamed with a very simple new servlet
> * Configuration can be displayed based on configuration points
> A new interface can be introduced for `InsightPoint`s where all the affected
> logs/levels, metrics and config classes can be defined for each components.
> Prometheus servlet endpoint can be changed to be turned on by default.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]