nickva commented on a change in pull request #1972: Introduce Shard Splitting 
To CouchDB
URL: https://github.com/apache/couchdb/pull/1972#discussion_r267927717
 
 

 ##########
 File path: src/mem3/README_reshard.md
 ##########
 @@ -0,0 +1,88 @@
+Developer Oriented Resharding Description
+=========================================
+
+This is a technical description of the resharding logic. The discussion will 
focus on: job creation and life-cycle, data definitions, and on the state 
transition mechanisms.
+
+
+Job Life-Cycle
+--------------
+
+Job creation happens in the `mem3_reshard_httpd` API handler module. That 
module uses `mem3_reshard_http_util` to do some validation, and eventually 
calls `mem3_reshard:start_split_job/2` on one or more nodes in the cluster 
depending where the new jobs should run.
+
+`mem3_reshard:start_split_job/2` is the main Erlang API entry point.  After 
some more validation it creates a `#job{}` record and calls the `mem3_reshard` 
job manager. The manager then will add the job to its jobs ets table, save it 
to a `_local` document in the shards db, and most importantly, start a new 
resharding process. That process will be supervised by the 
`mem3_reshard_job_sup` supervisor.
+
+Each job will be running in a gen_server implemented in `mem3_reshard_job` 
module. When splitting a shard, a job will go through a series of steps such as 
`initial_copy`, `build_indices`, `update_shard_map`, etc. Between each step it 
will report progress and checkpoint with `mem3_reshard` manager. A checkpoint 
involved the `mem3_reshard` manager persisting that job's state to disk in 
`_local` document in _dbs db. Then job continues until `completed` state or 
until it failed in the `failed` state.
+
+If a user stops shard splitting on the whole cluster, then all running jobs 
will stop. When shard splitting is resumed, they will try to recover from their 
last checkpoint.
+
+A job can also be individually stopped or resumed. If a job is individually 
stopped it will stay so even if the global shard splitting state is `running`. 
A user has to explictly set that job to a `running` state for it to resume.
+
+
+Data Definitions
+----------------
+
+This section focuses on record definition and how data is transformed to and 
from various formats.
+
+Right below the `mem3_reshard:start_split_job/1` API level a job is converted 
to a `#job{}` record defined in the `mem3_reshard.hrl` header file. That record 
is then used throughout most of the resharding code. The job manager 
`mem3_reshard` stores it in its jobs ets table, then when a job process is 
spawn it single argument also just a `#job{}` record. As a job process is 
executing it will periodically report state back to the `mem3_reshard` manager 
again as an updated `#job{}` record.
+
+Some interesting fields from the `#job{}` record:
+
+ - `id` Uniquely identifies a job in a cluster. It is derived from the source 
shard name, node and a version (currently = 1).
+ - `type` Currently the only type supported is `split` but `merge` or 
`rebalance` might be added in the future.
+ - `job_state` The running state of the job. Indicates if the job is 
`running`, `stopped`, `completed` or `failed`.
+ - `split_state` Once the job is running this indicates how far along it got 
in the splitting process.
+ - `source` Source shard file. If/when merge is implemented this will be a 
list.
+ - `target` List of target shard files. This is expected to be a list of 2 
times currently.
+ - `history` A time-line of state transitions represented as a list of tuples.
+ - `pid` When job is running this will be set to the pid of the process.
+
+
+Because jobs persist across node restarts, this `#job{}` record also has to be 
saved to disk. That happens in the `mem3_reshard_job_store` module. There a 
`#job{}` record is transformed to/from an ejson representation. Those 
transformation functions are dual-purpose and are also used to the HTTP API to 
return job's state and other information to the user.
+
+Another important piece of data is the global resharding state. When a user 
disables resharding on a cluster, a call is made to `mem3_reshard` manager on 
each node and they store that in a `#state{}` record. This record is defined in 
the `mem3_reshard.hrl` module, and just like the `#job{}` record can be 
translated to/from ejson in the `mem3_reshard_store` and stored and loaded from 
disk.
+
+
+State Transitions
+-----------------
+
+Resharding logic has 3 separate states to keep track of:
+
+1. Per-node resharding state. This state can be toggled between `running` and 
`stopped`. That toggle happens via the `mem3_reshard:start/0` and 
`mem3_reshard:stop/1` function.  This state is kept in the `#state{}` record of 
the `mem3_reshard` manager gen_server. This state is also persisted to the 
local shard map database as a `_local` document so that it is maintained 
through a node restart. When the state is `running` then all jobs that are not 
individually `stopped`, and have not failed or completed, will be `running`. 
When the state is `stopped` all the running jobs will be `stopped`.
+
+2. Job's running state held in the `#job{}` `job_state` field. This is the 
general running state of a resharding job. It can be `new`, `running`, 
`stopped`, `completed` or `failed`. This state is most relevant for the 
`mem3_reshard` manager. In other words, it is the `mem3_reshard` gen_server 
that starts the job, monitors it to see if it exits successfully on completion 
or with an error.
+
+3. Job's internal splitting state. This state tracks the steps taken during 
shard splitting by each job. This state is mostly relevant for the 
`mem3_reshard_job` module. This state is kept in `#job{}`'s `split_state` 
field. The progression of these states is linear going from one state to the 
next. That's reflected in the code as well, when one state is done, 
`mem3_reshard_job:next_state/1` is called which returns the next state in the 
list. The list is defined in `mem3_reshard.hrl` file in `SPLIT_STATES` macro. 
This simplistic transition is also one of the reasons why a gen_fsm wasn't 
considered for `mem3_reshard_job` logic.
+
+Another interesting aspect is how the `split_state` transitions happen in 
`mem3_reshard_job`. What follows is an examination of that. A job starts 
running in the `new` state or from a previously checkpointed state. In the 
later case, the job goes through some recovery logic (see `?STATE_RESTART` 
macro in `mem3_reshard_job`) where it tries to resume its work from where it 
left of. It means, for example, if it was in the `initial_copy` state and was 
interrupted it might have to reset the target files and copy everything again. 
After recovery, the state execution logic is driven by `run(#job{})` which ends 
up calling `?MODULE:State(#job{})` state specific functions for each state.
+
+In `mem3_reshard_job:switch_state/2` job's history is updated, any current 
`state_info` is cleared, job state is switched in the `#job{}` record and then 
`mem3_reshard` asked to checkpoint this state. That notificaiton is done via 
gen_server cast message. After that is send the job process sits and waits. In 
the meantime `mem3_reshard` manager checkpoints, which mean it updates both its 
ets table with the new `#job{}` record and also persists the state with the 
`mem3_reshard_store`, and finally it notifies the job process that 
checkpointing is done by calling `mem3_reshard_job:checkpoint_done/1`.
 
 Review comment:
   Sent is correct, silly me. Thanks :-)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to