This is an automated email from the ASF dual-hosted git repository. vatamane pushed a commit to branch reshard in repository https://gitbox.apache.org/repos/asf/couchdb-documentation.git
commit dee7e45859bee6f919f68c767abc785dbb655514 Author: Nick Vatamaniuc <[email protected]> AuthorDate: Tue Mar 19 13:53:01 2019 -0400 Add Shard Splitting RFC --- rfcs/002-shard-splitting.md | 373 ++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 373 insertions(+) diff --git a/rfcs/002-shard-splitting.md b/rfcs/002-shard-splitting.md new file mode 100644 index 0000000..54b0727 --- /dev/null +++ b/rfcs/002-shard-splitting.md @@ -0,0 +1,373 @@ +--- +name: Shard Splitting +about: Introduce Shard Splitting to CouchDB +title: 'Shard Splitting' +labels: rfc, discussion +assignees: '@nickva' + +--- + +# Introduction + +This RFC proposes adding the capability to split shards to CouchDB. The API and +the internals will also allow for other operations on shards in the future such +as merging or rebalancing. + +## Abstract + +Since CouchDB 2.0 clustered databases have had a fixed Q value defined at +creation. This often requires users to predict database usage ahead of time +which can be hard to do. A too low of a value might result in large shards, +slower performance, and needing more disk space to do compactions. + +It would be nice to start with a low Q initially, for example Q=1, and as +usage grows to be able to split some shards that grow too big. Especially +with partitioned queries being available there will be a higher chance +of having uneven sized shards and so it would be beneficial to split the +larger ones to even out the size distribution across the cluster. + +## Requirements Language + +[NOTE]: # ( Do not alter the section below. Follow its instructions. ) + +The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", +"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this +document are to be interpreted as described in +[RFC 2119](https://www.rfc-editor.org/rfc/rfc2119.txt). + +## Terminology + +*resharding* : Manipulating CouchDB shards. Could be splitting, merging, +rebalancing or other operations. This will be used as the top-level API +endpoint name with the idea that in the future different types of shard +manipulation jobs would be added. + +--- + +# Detailed Description + +From the user's perspective there would be a new HTTP API endpoint - +`_reshard/*`. A POST request to `_reshard/jobs/` would start resharding jobs. +Initially these will be of just one "type":"split" but in the future other +types could be added. + +Users would then be able to monitor the state of these jobs to inspect their +progress, see when they completed or failed. + +The API should be designed to be consistent with `_scheduler/jobs` as much as +possible since that is another recent CouchDB's API exposing an internal jobs +list. + +Most of the code implementing this would live in the mem3 application with some +lower level components in the *couch* application. There will be a new child in +the *mem3_sup* supervisor responsible for resharding called *mem3_reshard_sup*. +It will have a *mem3_reshard* manager process which should have an Erlang API +for starting jobs, stopping jobs, removing them, and inspecting their state. +Individual jobs would be instances of a gen_server defined in +*mem3_reshard_job* module. There will be simple-one-for-one supervisor under +*mem3_reshard_sup* named *mem3_reshard_job_sup* to keep track of +*mem3_reshard_job* children . + +An individual shard splitting job will follow roughly these steps in order: + +- **Create targets**. Targets are created. Some target properties should match + the source. This means matching the PSE engine if source uses a custom one. + If source is partitioned, targets should be partitioned as well, etc. + +- **Initial bulk copy.** After the targets are created, copy all the document + in the source shard to the targets. This operation should be as optimized as + possible as it could potentially copy tens of GBs of data. For this reason + this piece of code will be closer the what the compactor does. + +- **Build indices**. The source shard might have had up-to-date indices and so + it is beneficial for the split version to have them as well. Here we'd + inspect all `_design` docs and rebuild all the known indices. After this step + there will be a "topoff" step to replicate any change that might have + occurred on the source while the indices were built. + +- **Update shard map**. Here the global shard map is updated to remove the old + source shard and replace it with the targets. There will be a corresponding + entry added into the shard's document `changelog entry` indicating that a + split happened. To avoid conflicts being generated when multiple copies of a + range finish splitting and race to update the shard map. All shard map + updates will be routes through one consistently picked node (lowest in the + list connected nodes when they are sorted). After shard map is updated. There + will be another topoff replication job to bring in changes from the source + shard to the targets that might have occurred while the shard map was + updating. + +- **Delete source shard** + +This progression of split states will be visible when inspecting a job's status +as well as in the history in the `detail` field of each event. + + +# Advantages and Disadvantages + +Main advantage is to dynamically change shard size distribution on a cluster in +response to changing user requirements without having to delete and recreate +databases. + +One disadvantage is that it might break some basic constraints about all copies +of a shard range being the same size. A user could choose to split for example +a shard copy 00..-ff... on node1 only so on node2 and node3 the copy will be +00-..ff.. but on node1 there will now be 00-..7f.. and 80-ff... External +tooling inspecting $db/_shards endpoint might need to be updated to handle this +scenario. A mitigating factor here is that resharding in the current proposal +is not automatic it is an operation triggered manually by the users. + +# Key Changes + +The main change is the ability to split shard via the `_reshard/*` HTTP API + +## Applications and Modules affected + +Most of the changes will be in the *mem3* application with some changes in the *couch* application as well. + +## HTTP API additions + +`* GET /_reshard` + +Top level summary. Besides the new _reshard endpoint, there `reason` and the stats are more detailed. + +Returns + +``` +{ + "completed": 3, + "failed": 4, + "running": 0, + "state": "stopped", + "state_reason": "Manual rebalancing", + "stopped": 0, + "total": 7 +} +``` + +* `PUT /_reshard/state` + +Start or stop global rebalacing. + +Body +``` +{ + "state": "stopped", + "reason": "Manual rebalancing" +} +``` + +Returns + +``` +{ + "ok": true +} +``` + +* `GET /_reshard/state` + +Return global resharding state and reason. + +``` +{ + "reason": "Manual rebalancing", + "state": "stopped" +} +``` + +* `GET /_reshard/jobs` + +Get the state of all the resharding jobs on the cluster. Now we have a detailed +state transition history which looks similar what _scheduler/jobs have. + +``` +{ + "jobs": [ + { + "history": [ + { + "detail": null, + "timestamp": "2019-02-06T22:28:06Z", + "type": "new" + }, + ... + { + "detail": null, + "timestamp": "2019-02-06T22:28:10Z", + "type": "completed" + } + ], + "id": "001-0a308ef9f7bd24bd4887d6e619682a6d3bb3d0fd94625866c5216ec1167b4e23", + "job_state": "completed", + "node": "[email protected]", + "source": "shards/00000000-ffffffff/db1.1549492084", + "split_state": "completed", + "start_time": "2019-02-06T22:28:06Z", + "state_info": {}, + "target": [ + "shards/00000000-7fffffff/db1.1549492084", + "shards/80000000-ffffffff/db1.1549492084" + ], + "type": "split", + "update_time": "2019-02-06T22:28:10Z" + }, + { + .... + }, + ], + "offset": 0, + "total_rows": 7 +} +``` + +* `POST /_reshard/jobs` + +Create a new resharding job. This can now take other parameters and can split multiple ranges. + +To split one shard on a particular node + +``` +{ + "type": "split", + "shard": "shards/80000000-bfffffff/db1.1549492084" + "node": "[email protected]" +} +``` + +To split a particular range on all nodes: + +``` +{ + "type": "split", + "db" : "db1", + "range" : "80000000-bfffffff" +} +``` + +To split a range on just one node: + +``` +{ + "type": "split", + "db" : "db1", + "range" : "80000000-bfffffff", + "node": "[email protected]" +} +``` + +To split all ranges of a db on one node: + +``` +{ + "type": "split", + "db" : "db1", + "node": "[email protected]" +} +``` + +Result now may contain multiple job IDs + +``` +[ + { + "id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790", + "node": "[email protected]", + "ok": true, + "shard": "shards/80000000-bfffffff/db1.1549986514" + }, + { + "id": "001-7c1d20d2f7ef89f6416448379696a2cc98420e3e7855fdb21537d394dbc9b35f", + "node": "[email protected]", + "ok": true, + "shard": "shards/c0000000-ffffffff/db1.1549986514" + } +] +``` + +* `GET /_reshard/jobs/$jobid` + +Get just one job by its ID + +``` +{ + "history": [ + { + "detail": null, + "timestamp": "2019-02-12T16:55:41Z", + "type": "new" + }, + { + "detail": "Shard splitting disabled", + "timestamp": "2019-02-12T16:55:41Z", + "type": "stopped" + } + ], + "id": "001-d457a4ea82877a26abbcbcc0e01c4b0070027e72b5bf0c4ff9c89eec2da9e790", + "job_state": "stopped", + "node": "[email protected]", + "source": "shards/80000000-bfffffff/db1.1549986514", + "split_state": "new", + "start_time": "1970-01-01T00:00:00Z", + "state_info": { + "reason": "Shard splitting disabled" + }, + "target": [ + "shards/80000000-9fffffff/db1.1549986514", + "shards/a0000000-bfffffff/db1.1549986514" + ], + "type": "split", + "update_time": "2019-02-12T16:55:41Z" +} +``` + +* `GET /_reshard/jobs/$jobid/state` + +Get the running state of a particular job only + +``` +{ + "reason": "Shard splitting disabled", + "state": "stopped" +} +``` + +* `PUT /_reshard/jobs/$jobid/state` + +Stop or resume a particular job + +Request body + +``` +{ + "state": "stopped", + "reason": "Pause this job for now" +} +``` + + +## HTTP API deprecations + +None + +# Security Considerations + +None. + +# References + +Original RFC-as-an-issue: + +https://github.com/apache/couchdb/issues/1920 + +Most of the discussion regarding this has happened on the `@dev` mailing list: + +https://mail-archives.apache.org/mod_mbox/couchdb-dev/201901.mbox/%3CCAJd%3D5Hbs%2BNwrt0%3Dz%2BGN68JPU5yHUea0xGRFtyow79TmjGN-_Sg%40mail.gmail.com%3E + +https://mail-archives.apache.org/mod_mbox/couchdb-dev/201902.mbox/%3CCAJd%3D5HaX12-fk2Lo8OgddQryZaj5KRa1GLN3P9LdYBQ5MT0Xew%40mail.gmail.com%3E + + +# Acknowledgments + +@davisp @kocolosk : Collaborated on the initial idea and design + +@mikerhodes @wohali @janl @iilyak : Additionally collaborated on API design
