Michael Dürig created OAK-5468:
----------------------------------
Summary: Ease TarMK Operations
Key: OAK-5468
URL: https://issues.apache.org/jira/browse/OAK-5468
Project: Jackrabbit Oak
Issue Type: Epic
Components: segment-tar
Reporter: Michael Dürig
Fix For: 1.8
h2. Ease of TarMK Operations
This epic is all about simplifying the operational aspects of the TarMK.
Broadly this can be broken down into the following three topics.
h3. Monitoring
* We need to improve monitoring for system load and health. It should be easy
for operators to figure out which parts of the TarMK are within safe bounds and
and which are not.
* Failures should be easy to diagnose and pinpoint the root cause. It should be
evident if and how a failures can be fixed by the operator.
h3. Management
* Management tasks should be easy to use, clear and safe. It should be evident
how to achieve a certain task, what it means to execute it and what its
parameters mean (discoverability). Executing a task should no cause harm to the
system because the system is not in the right state (e.g. running restore
concurrently to backup should be safe).
h3. Tooling
* We need better tooling for diagnosing systems. E.g. Analysis of file stores
(what content, how much content, distribution over space and time,
reachability, retention time, garbage, etc.) Both, online and offline (i.e.
post mortem).
h2. Individual improvements
Below is a list of items to address in no specific order. Let's start
extracting them into individual issues linked to this epic as we start tackling
this.
h3. Monitoring
* Throughput (e.g. time to commit, time to save, etc.)
* Thrashing (setting on thereof)
* SNFE (transient vs. catastrophic)
* DSGC
* FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties,
etc.)
* Cold standby (progress, liveliness, latency, etc.)
* ...
h3. Management
* Revisit backup/restore (OAK-5103, OAK-4866)
* Coordination of management operations (ability to run conditionally, prevent
them from running concurrently, etc.)
h3. Tooling
* Progress monitor {{oak-run compact}}
* Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove
garbage left by prior crash)
* Bring {{oak-run check}} up to date. Address scalability and performance
issues. Include more useful statistics (e.g. node types, child node lists,
content distribution, etc.)
* Changes over time
* Consolidation of various (unversioned) scripts into oak-run like 'node count
script', 'node remove script'.
* Allow connecting tools to a running instance.
* Snapshotting support: restartable stats collection (snapshot at certain
revision, diff to collect extras)
* "Friendly" output formats that can be easily used by other tools (e.g. Unix
tools, Kibana, etc.)
* Proper usage of stdin and stdout
* Proper exit codes
* Current gap in tooling is around the idea of healing a repository plagued
with SNFEs, bridge the gap between {{oak-run check}} and 'oak console node
count script', provide options to plug in the holes, so AEM is usable. One idea
would be to complement rolling back the segment store to the last good revision
with rolling it forward to a new and fixed good revisions. The simplest way of
fixing is to just replace unreadable items with empty ones (i.e. "plugging the
wholes"). From there one could diff this new fixed revision against the last
good revision to asses the damage and see what else needs fixing (e.g. to
regain consistency wrt. to JCR).
* Classification of tools between development / research/ experimental and
production (customer facing). The latter need a different level of support,
maintenance, QE, documentation etc. Possibly mark via documentation which is
which.
* Group commands from oak-run in namespaces. Assign a different namespace to
each persistence implementation in Oak. Let every implementation parse its own
commands. Move commands closer to their implementation and relieve oak-run from
code bloat. See OAK-5437 for further details.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)