[ https://issues.apache.org/jira/browse/OAK-5468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Dürig updated OAK-5468: ------------------------------- Fix Version/s: (was: 1.10) (was: 1.9.0) > Ease TarMK Operations > --------------------- > > Key: OAK-5468 > URL: https://issues.apache.org/jira/browse/OAK-5468 > Project: Jackrabbit Oak > Issue Type: Epic > Components: segment-tar > Reporter: Michael Dürig > Assignee: Michael Dürig > Priority: Major > Labels: management, monitoring, operations, tooling > > h2. Ease of TarMK Operations > This epic is all about simplifying the operational aspects of the TarMK. > Broadly this can be broken down into the following three topics. > h3. Monitoring > * We need to improve monitoring for system load and health. It should be easy > for operators to figure out which parts of the TarMK are within safe bounds > and and which are not. > * Failures should be easy to diagnose and pinpoint the root cause. It should > be evident if and how a failures can be fixed by the operator. > h3. Management > * Management tasks should be easy to use, clear and safe. It should be > evident how to achieve a certain task, what it means to execute it and what > its parameters mean (discoverability). Executing a task should no cause harm > to the system because the system is not in the right state (e.g. running > restore concurrently to backup should be safe). > h3. Tooling > * We need better tooling for diagnosing systems. E.g. Analysis of file stores > (what content, how much content, distribution over space and time, > reachability, retention time, garbage, etc.) Both, online and offline (i.e. > post mortem). > h2. Individual improvements > Below is a list of items to address in no specific order. Let's start > extracting them into individual issues linked to this epic as we start > tackling this. > h3. Monitoring > * Throughput (e.g. time to commit, time to save, etc.) > * Thrashing (setting on thereof) > * SNFE (transient vs. catastrophic) > * DSGC > * FileStore (e.g. size on disk, #tar files, #segments, #nodes, #properties, > etc.) > * Cold standby (progress, liveliness, latency, etc.) > * ... > h3. Management > * Revisit backup/restore (OAK-5103, OAK-4866) > * Coordination of management operations (ability to run conditionally, > prevent them from running concurrently, etc.) > h3. Tooling > * Progress monitor {{oak-run compact}} > * Crash recovery for {{oak-run compact}} (e.g. run cleanup only to remove > garbage left by prior crash) > * Bring {{oak-run check}} up to date. Address scalability and performance > issues. Include more useful statistics (e.g. node types, child node lists, > content distribution, etc.) > * Changes over time > * Consolidation of various (unversioned) scripts into oak-run like 'node > count script', 'node remove script'. > * Allow connecting tools to a running instance. > * Snapshotting support: restartable stats collection (snapshot at certain > revision, diff to collect extras) > * "Friendly" output formats that can be easily used by other tools (e.g. Unix > tools, Kibana, etc.) > * Proper usage of stdin and stdout > * Proper exit codes > * Current gap in tooling is around the idea of healing a repository plagued > with SNFEs, bridge the gap between {{oak-run check}} and 'oak console node > count script', provide options to plug in the holes to restore the repository > to a consistent state. One idea would be to complement rolling back the > segment store to the last good revision with rolling it forward to a new and > fixed good revisions. The simplest way of fixing is to just replace > unreadable items with empty ones (i.e. "plugging the holes"). From there one > could diff this new fixed revision against the last good revision to asses > the damage and see what else needs fixing (e.g. to regain consistency wrt. to > JCR). > * Classification of tools between development / research/ experimental and > production (customer facing). The latter need a different level of support, > maintenance, QE, documentation etc. Possibly mark via documentation which is > which. > * Group commands from oak-run in namespaces. Assign a different namespace to > each persistence implementation in Oak. Let every implementation parse its > own commands. Move commands closer to their implementation and relieve > oak-run from code bloat. See OAK-5437 for further details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)