[
https://issues.apache.org/jira/browse/OAK-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-5468:
----------------------------------
Fix Version/s: (was: 1.8.0)
> Ease TarMK Operations
> ---------------------
>
> Key: OAK-5468
> URL: https://issues.apache.org/jira/browse/OAK-5468
> Project: Jackrabbit Oak
> Issue Type: Epic
> Components: segment-tar
> Reporter: Michael Dürig
> Assignee: Michael Dürig
> Labels: management, monitoring, operations, tooling
> Fix For: 1.9.0
>
>
> h2. Ease of TarMK Operations
> This epic is all about simplifying the operational aspects of the TarMK.
> Broadly this can be broken down into the following three topics.
> h3. Monitoring
> * We need to improve monitoring for system load and health. It should be easy
> for operators to figure out which parts of the TarMK are within safe bounds
> and and which are not.
> * Failures should be easy to diagnose and pinpoint the root cause. It should
> be evident if and how a failures can be fixed by the operator.
> h3. Management
> * Management tasks should be easy to use, clear and safe. It should be
> evident how to achieve a certain task, what it means to execute it and what
> its parameters mean (discoverability). Executing a task should no cause harm
> to the system because the system is not in the right state (e.g. running
> restore concurrently to backup should be safe).
> h3. Tooling
> * We need better tooling for diagnosing systems. E.g. Analysis of file stores
> (what content, how much content, distribution over space and time,
> reachability, retention time, garbage, etc.) Both, online and offline (i.e.
> post mortem).
> h2. Individual improvements
> Below is a list of items to address in no specific order. Let's start
> extracting them into individual issues linked to this epic as we start
> tackling this.
> h3. Monitoring
> * Throughput (e.g. time to commit, time to save, etc.)
> * Thrashing (setting on thereof)
> * SNFE (transient vs. catastrophic)
> * DSGC
> * FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties,
> etc.)
> * Cold standby (progress, liveliness, latency, etc.)
> * ...
> h3. Management
> * Revisit backup/restore (OAK-5103, OAK-4866)
> * Coordination of management operations (ability to run conditionally,
> prevent them from running concurrently, etc.)
> h3. Tooling
> * Progress monitor {{oak-run compact}}
> * Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove
> garbage left by prior crash)
> * Bring {{oak-run check}} up to date. Address scalability and performance
> issues. Include more useful statistics (e.g. node types, child node lists,
> content distribution, etc.)
> * Changes over time
> * Consolidation of various (unversioned) scripts into oak-run like 'node
> count script', 'node remove script'.
> * Allow connecting tools to a running instance.
> * Snapshotting support: restartable stats collection (snapshot at certain
> revision, diff to collect extras)
> * "Friendly" output formats that can be easily used by other tools (e.g. Unix
> tools, Kibana, etc.)
> * Proper usage of stdin and stdout
> * Proper exit codes
> * Current gap in tooling is around the idea of healing a repository plagued
> with SNFEs, bridge the gap between {{oak-run check}} and 'oak console node
> count script', provide options to plug in the holes to restore the repository
> to a consistent state. One idea would be to complement rolling back the
> segment store to the last good revision with rolling it forward to a new and
> fixed good revisions. The simplest way of fixing is to just replace
> unreadable items with empty ones (i.e. "plugging the holes"). From there one
> could diff this new fixed revision against the last good revision to asses
> the damage and see what else needs fixing (e.g. to regain consistency wrt. to
> JCR).
> * Classification of tools between development / research/ experimental and
> production (customer facing). The latter need a different level of support,
> maintenance, QE, documentation etc. Possibly mark via documentation which is
> which.
> * Group commands from oak-run in namespaces. Assign a different namespace to
> each persistence implementation in Oak. Let every implementation parse its
> own commands. Move commands closer to their implementation and relieve
> oak-run from code bloat. See OAK-5437 for further details.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)