nickva commented on a change in pull request #1972: Introduce Shard Splitting To CouchDB URL: https://github.com/apache/couchdb/pull/1972#discussion_r268806817
########## File path: src/mem3/README_reshard.md ########## @@ -0,0 +1,93 @@ +Developer Oriented Resharding Description +========================================= + +This is a technical description of the resharding logic. The discussion will focus on: job creation and life-cycle, data definitions, and on the state transition mechanisms. + + +Job Life-Cycle +-------------- + +Job creation happens in the `mem3_reshard_httpd` API handler module. That module uses `mem3_reshard_http_util` to do some validation, and eventually calls `mem3_reshard:start_split_job/2` on one or more nodes in the cluster depending where the new jobs should run. + +`mem3_reshard:start_split_job/2` is the main Erlang API entry point. After some more validation it creates a `#job{}` record and calls the `mem3_reshard` job manager. The manager then will add the job to its jobs ets table, save it to a `_local` document in the shards db, and most importantly, start a new resharding process. That process will be supervised by the `mem3_reshard_job_sup` supervisor. + +Each job will be running in a gen_server implemented in `mem3_reshard_job` module. When splitting a shard, a job will go through a series of steps such as `initial_copy`, `build_indices`, `update_shard_map`, etc. Between each step it will report progress and checkpoint with `mem3_reshard` manager. A checkpoint involved the `mem3_reshard` manager persisting that job's state to disk in `_local` document in `_dbs` db. Then job continues until `completed` state or until it failed in the `failed` state. + +If a user stops shard splitting on the whole cluster, then all running jobs will stop. When shard splitting is resumed, they will try to recover from their last checkpoint. + +A job can also be individually stopped or resumed. If a job is individually stopped it will stay so even if the global shard splitting state is `running`. A user has to explicitly set that job to a `running` state for it to resume. If a node with running jobs is turned off, when it is rebooted running jobs will resume from their last checkpoint. + + +Data Definitions +---------------- + +This section focuses on record definition and how data is transformed to and from various formats. + +Right below the `mem3_reshard:start_split_job/1` API level a job is converted to a `#job{}` record defined in the `mem3_reshard.hrl` header file. That record is then used throughout most of the resharding code. The job manager `mem3_reshard` stores it in its jobs ets table, then when a job process is spawn it single argument also just a `#job{}` record. As a job process is executing it will periodically report state back to the `mem3_reshard` manager as an updated `#job{}` record. + +Some interesting fields from the `#job{}` record: + + - `id` Uniquely identifies a job in a cluster. It is derived from the source shard name, node and a version (currently = 1). + - `type` Currently the only type supported is `split` but `merge` or `rebalance` might be added in the future. + - `job_state` The running state of the job. Indicates if the job is `running`, `stopped`, `completed` or `failed`. + - `split_state` Once the job is running this indicates how far along it got in the splitting process. + - `source` Source shard file. If/when merge is implemented this will be a list. + - `target` List of target shard files. This is expected to be a list of 2 times currently. Review comment: Thanks! ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services