Ethan Rose created HDDS-10110:
---------------------------------

             Summary: Use RocksDB key count estimates instead of OM metrics file
                 Key: HDDS-10110
                 URL: https://issues.apache.org/jira/browse/HDDS-10110
             Project: Apache Ozone
          Issue Type: Sub-task
          Components: OM
            Reporter: Ethan Rose
            Assignee: Ethan Rose


HDDS-816 added a json file in the OM to store persisted metrics like key count. 
The Jira has a doc attached that compares some options and decides that 
periodically flushing to a json file is the best approach. However, it neglects 
many issues with saving metrics this way:

* Error handling was missed. See HDDS-10094
* OMs' metrics can diverge if OMs are restarted at different times between 
flushes of the file.
* On snapshot install on a follower, the metric will be [reset to estimated 
row|https://github.com/apache/ozone/blob/14e7ff1e6fb2bf11f1df054c63b6e1729e328286/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java#L4006]
 count anyways. This follower will now have diverged metrics from the other OMs.
* When metrics for various OMs diverge, they will show different lines in 
dashboarding applications like Grafana, which may be confusing for users.
* Restoring the metric to a correct value after bugs like HDDS-10063 requires 
some sort of manual repair.
* Once metrics diverge between OMs, even a restart will not bring them back in 
sync.

[HDDS-1829|https://issues.apache.org/jira/browse/HDDS-1829] later added the 
ability for some metrics to be updated based on RocksDB key count estimates. 
See {{Q: How to know the number of keys stored in a RocksDB database?}} 
[RocksDB FAQ|https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ]. These 
metrics survive restart using the key count estimate and do not use the metrics 
json file, so we have two divergent implementations. However, once these 
metrics are updated on startup, they are not incremented as new OM operations 
come in.

This jira proposes:

# Get rid of the OM metrics json file.
# Use key count estimates for all metrics that must survive a restart.
# Continue to update these metrics as OM requests come in.

While the RocksDB estimated key count will not be totally accurate, the json 
based approach will not be either. The RocksDB approach is easier to maintain 
both in terms of code required and fixing metric counting bugs.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to