[ 
https://issues.apache.org/jira/browse/OAK-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Dürig updated OAK-5468:
-------------------------------
    Fix Version/s:     (was: 1.10)
                       (was: 1.9.0)

> Ease TarMK Operations
> ---------------------
>
>                 Key: OAK-5468
>                 URL: https://issues.apache.org/jira/browse/OAK-5468
>             Project: Jackrabbit Oak
>          Issue Type: Epic
>          Components: segment-tar
>            Reporter: Michael Dürig
>            Assignee: Michael Dürig
>            Priority: Major
>              Labels: management, monitoring, operations, tooling
>
> h2. Ease of TarMK Operations
> This epic is all about simplifying the operational aspects of the TarMK. 
> Broadly this can be broken down into the following three topics.
> h3. Monitoring
> * We need to improve monitoring for system load and health. It should be easy 
> for operators to figure out which parts of the TarMK are within safe bounds 
> and and which are not.
> * Failures should be easy to diagnose and pinpoint the root cause. It should 
> be evident if and how a failures can be fixed by the operator. 
> h3. Management
> * Management tasks should be easy to use, clear and safe. It should be 
> evident how to achieve a certain task, what it means to execute it and what 
> its parameters mean (discoverability). Executing a task should no cause harm 
> to the system because the system is not in the right state (e.g. running 
> restore concurrently to backup should be safe). 
> h3. Tooling
> * We need better tooling for diagnosing systems. E.g. Analysis of file stores 
> (what content, how much content, distribution over space and time, 
> reachability, retention time, garbage, etc.) Both, online and offline (i.e. 
> post mortem).
> h2. Individual improvements
> Below is a list of items to address in no specific order. Let's start 
> extracting them into individual issues linked to this epic as we start 
> tackling this. 
> h3. Monitoring
> * Throughput (e.g. time to commit, time to save, etc.)
> * Thrashing (setting on thereof)
> * SNFE (transient vs. catastrophic)
> * DSGC
> * FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, 
> etc.)
> * Cold standby (progress, liveliness, latency, etc.)
> * ...
> h3. Management
> * Revisit backup/restore (OAK-5103, OAK-4866)
> * Coordination of management operations (ability to run conditionally, 
> prevent them from running concurrently, etc.)
> h3. Tooling
> * Progress monitor {{oak-run compact}}
> * Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove 
> garbage left by prior crash)
> * Bring {{oak-run check}} up to date. Address scalability and performance 
> issues. Include more useful statistics (e.g. node types, child node lists, 
> content distribution, etc.)
> * Changes over time
> * Consolidation of various (unversioned) scripts into oak-run like 'node 
> count script', 'node remove script'.
> * Allow connecting tools to a running instance.        
> * Snapshotting support: restartable stats collection (snapshot at certain 
> revision, diff to collect extras)
> * "Friendly" output formats that can be easily used by other tools (e.g. Unix 
> tools, Kibana, etc.)
> * Proper usage of stdin and stdout
> * Proper exit codes
> * Current gap in tooling is around the idea of healing a repository plagued 
> with SNFEs, bridge the gap between {{oak-run check}} and 'oak console node 
> count script', provide options to plug in the holes to restore the repository 
> to a consistent state. One idea would be to complement rolling back the 
> segment store to the last good revision with rolling it forward to a new and 
> fixed good revisions. The simplest way of fixing is to just replace 
> unreadable items with empty ones (i.e. "plugging the holes"). From there one 
> could diff this new fixed revision against the last good revision to asses 
> the damage and see what else needs fixing (e.g. to regain consistency wrt. to 
> JCR). 
> * Classification of tools between development / research/ experimental and 
> production (customer facing). The latter need a different level of support, 
> maintenance, QE, documentation etc. Possibly mark via documentation which is 
> which. 
> * Group commands from oak-run in namespaces. Assign a different namespace to 
> each persistence implementation in Oak. Let every implementation parse its 
> own commands. Move commands closer to their implementation and relieve 
> oak-run from code bloat. See OAK-5437 for further details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to