Michael Dürig created OAK-5468:
----------------------------------

             Summary: Ease TarMK Operations
                 Key: OAK-5468
                 URL: https://issues.apache.org/jira/browse/OAK-5468
             Project: Jackrabbit Oak
          Issue Type: Epic
          Components: segment-tar
            Reporter: Michael Dürig
             Fix For: 1.8


h2. Ease of TarMK Operations

This epic is all about simplifying the operational aspects of the TarMK. 
Broadly this can be broken down into the following three topics.

h3. Monitoring
* We need to improve monitoring for system load and health. It should be easy 
for operators to figure out which parts of the TarMK are within safe bounds and 
and which are not.
* Failures should be easy to diagnose and pinpoint the root cause. It should be 
evident if and how a failures can be fixed by the operator. 

h3. Management
* Management tasks should be easy to use, clear and safe. It should be evident 
how to achieve a certain task, what it means to execute it and what its 
parameters mean (discoverability). Executing a task should no cause harm to the 
system because the system is not in the right state (e.g. running restore 
concurrently to backup should be safe). 

h3. Tooling
* We need better tooling for diagnosing systems. E.g. Analysis of file stores 
(what content, how much content, distribution over space and time, 
reachability, retention time, garbage, etc.) Both, online and offline (i.e. 
post mortem).


h2. Individual improvements

Below is a list of items to address in no specific order. Let's start 
extracting them into individual issues linked to this epic as we start tackling 
this. 

h3. Monitoring
* Throughput (e.g. time to commit, time to save, etc.)
* Thrashing (setting on thereof)
* SNFE (transient vs. catastrophic)
* DSGC
* FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, 
etc.)
* Cold standby (progress, liveliness, latency, etc.)
* ...

h3. Management
* Revisit backup/restore (OAK-5103, OAK-4866)
* Coordination of management operations (ability to run conditionally, prevent 
them from running concurrently, etc.)

h3. Tooling
* Progress monitor {{oak-run compact}}
* Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove 
garbage left by prior crash)
* Bring {{oak-run check}} up to date. Address scalability and performance 
issues. Include more useful statistics (e.g. node types, child node lists, 
content distribution, etc.)
* Changes over time
* Consolidation of various (unversioned) scripts into oak-run like 'node count 
script', 'node remove script'.
* Allow connecting tools to a running instance.        
* Snapshotting support: restartable stats collection (snapshot at certain 
revision, diff to collect extras)
* "Friendly" output formats that can be easily used by other tools (e.g. Unix 
tools, Kibana, etc.)
* Proper usage of stdin and stdout
* Proper exit codes
* Current gap in tooling is around the idea of healing a repository plagued 
with SNFEs, bridge the gap between {{oak-run check}} and 'oak console node 
count script', provide options to plug in the holes, so AEM is usable. One idea 
would be to complement rolling back the segment store to the last good revision 
with rolling it forward to a new and fixed good revisions. The simplest way of 
fixing is to just replace unreadable items with empty ones (i.e. "plugging the 
wholes"). From there one could diff this new fixed revision against the last 
good revision to asses the damage and see what else needs fixing (e.g. to 
regain consistency wrt. to JCR). 
* Classification of tools between development / research/ experimental and 
production (customer facing). The latter need a different level of support, 
maintenance, QE, documentation etc. Possibly mark via documentation which is 
which. 
* Group commands from oak-run in namespaces. Assign a different namespace to 
each persistence implementation in Oak. Let every implementation parse its own 
commands. Move commands closer to their implementation and relieve oak-run from 
code bloat. See OAK-5437 for further details.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to