[GitHub] flimzy commented on a change in pull request #268: Rewrite sharding documentation

GitBox Wed, 18 Jul 2018 01:30:18 -0700

flimzy commented on a change in pull request #268: Rewrite sharding 
documentation
URL: 
https://github.com/apache/couchdb-documentation/pull/268#discussion_r203290115


 ##########
 File path: src/cluster/sharding.rst
 ##########
 @@ -12,290 +12,490 @@
 
 .. _cluster/sharding:
 
-========
-Sharding
-========
+================
+Shard Management
+================
 
 .. _cluster/sharding/scaling-out:
 
-Scaling out
-===========
+Introduction
+------------
 
-Normally you start small and grow over time. In the beginning you might do just
-fine with one node, but as your data and number of clients grows, you need to
-scale out.
+A `shard
+<https://en.wikipedia.org/wiki/Shard_(database_architecture)>`__ is a
+horizontal partition of data in a database. Partitioning data into
+shards and distributing copies of each shard (called "shard replicas" or
+just "replicas") to different nodes in a cluster gives the data greater
+durability against node loss. CouchDB clusters automatically shard
+databases and distribute the subsections of documents that compose each
+shard among nodes. Modifying cluster membership and sharding behavior
+must be done manually.
 
-For simplicity we will start fresh and small.
+Shards and Replicas
+~~~~~~~~~~~~~~~~~~~
 
-Start ``node1`` and add a database to it. To keep it simple we will have 2
-shards and no replicas.
+How many shards and replicas each database has can be set at the global
+level, or on a per-database basis. The relevant parameters are *q* and
+*n*.
 
-.. code-block:: bash
+*q* is the number of database shards to maintain. *n* is the number of
+copies of each document to distribute. The default value for n is 3,
+and for q is 8. With q=8, the database is split into 8 shards. With
+n=3, the cluster distributes three replicas of each shard. Altogether,
+that's 24 shards for a single database. In a default 3-node cluster,
+each node would receive 8 shards. In a 4-node cluster, each node would
+receive 6 shards. We recommend in the general case that the number of
+nodes in your cluster should be a multiple of n, so that shards are
+distributed evenly.
 
-    curl -X PUT "http://xxx.xxx.xxx.xxx:5984/small?n=1&q=2"; --user daboss
+CouchDB nodes have a ``etc/default.ini`` file with a section named
+``[cluster]`` which looks like this:
 
-If you look in the directory ``data/shards`` you will find the 2 shards.
+::
 
-.. code-block:: text
+    [cluster]
+    q=8
+    n=3
 
-    data/
-    +-- shards/
-    |   +-- 00000000-7fffffff/
-    |   |    -- small.1425202577.couch
-    |   +-- 80000000-ffffffff/
-    |        -- small.1425202577.couch
+These settings can be modified to set sharding defaults for all
+databases, or they can be set on a per-database basis by specifying the
+``q`` and ``n`` query parameters when the database is created. For
+example:
 
-Now, check the node-local ``_dbs_`` database. Here, the metadata for each
-database is stored. As the database is called ``small``, there is a document
-called ``small`` there. Let us look in it. Yes, you can get it with curl too:
+.. code:: bash
 
-.. code-block:: javascript
+    $ curl -X PUT "$COUCH_URL/database-name?q=4&n=2"
 
-    curl -X GET "http://xxx.xxx.xxx.xxx:5986/_dbs/small";
+That creates a database that is split into 4 shards and 2 replicas,
+yielding 8 shards distributed throughout the cluster.
 
+Quorum
+~~~~~~
+
+Depending on the size of the cluster, the number of shards per database,
+and the number of shard replicas, not every node may have access to
+every shard, but every node knows where all the replicas of each shard
+can be found through CouchDB's internal shard map.
+
+Each request that comes in to a CouchDB cluster is handled by any one
+random coordinating node. This coordinating node proxies the request to
+the other nodes that have the relevant data, which may or may not
+include itself. The coordinating node sends a response to the client
+once a `quorum
+<https://en.wikipedia.org/wiki/Quorum_(distributed_computing)>`__ of
+database nodes have responded; 2, by default.
+
+The size of the required quorum can be configured at
+request time by setting the ``r`` parameter for document and view
+reads, and the ``w`` parameter for document writes. For example, here
+is a request that specifies that at least two nodes must respond in
+order to establish a quorum:
+
+.. code:: bash
+
+    $ curl "$COUCH_URL:5984/<doc>?r=2"
+
+Here is a similar example for writing a document:
+
+.. code:: bash
+
+    $ curl -X PUT "$COUCH_URL:5984/<doc>?w=2" -d '{...}'
+
+Setting ``r`` or ``w`` to be equal to ``n`` (the number of replicas)
+means you will only receive a response once all nodes with relevant
+shards have responded or timed out, and as such this approach does not
+guarantee `ACIDic consistency
+<https://en.wikipedia.org/wiki/ACID#Consistency>`__. Setting ``r`` or
+``w`` to 1 means you will receive a response after only one relevant
+node has responded.
+
+.. _cluster/sharding/move:
+
+Moving a shard
+--------------
+
+This section describes how to manually place and replace shards, and how
+to set up placement rules to assign shards to specific nodes. These
+activities are critical steps when you determine your cluster is too big
+or too small, and want to resize it successfully, or you have noticed
+from server metrics that database/shard layout is non-optimal and you
+have some "hot spots" that need resolving.
+
+Consider a three node cluster with q=8 and n=3. Each database has 24
+shards, distributed across the three nodes. If you add a fourth node to
+the cluster, CouchDB will not redistribute existing database shards to
+it. This leads to unbalanced load, as the new node will only host shards
+for databases created after it joined the cluster. To balance the
+distribution of shards from existing databases, they must be moved
+manually.
+
+Moving shards between nodes in a cluster involves the following steps:
+
+1. Copy the shard(s) and any secondary index shard(s) onto the target node.
+2. Set the target node to maintenance mode.
+3. Update cluster metadata to reflect the new target shard(s).
+4. Monitor internal replication to ensure up-to-date shard(s).
+5. Clear the target node's maintenance mode.
+6. Update cluster metadata again to remove the source shard(s)
+7. Remove the shard file(s) and secondary index file(s) from the source node.
+
+Copying shard files
+~~~~~~~~~~~~~~~~~~~
+
+Shard files live in the ``data/shards`` directory of your CouchDB
+install. Within those subdirectories are the shard files themselves. For
+instance, for a q=8 database called abc, here is its database shard
+files:
+
+::
+
+  data/shards/00000000-1fffffff/abc.1529362187.couch
+  data/shards/20000000-3fffffff/abc.1529362187.couch
+  data/shards/40000000-5fffffff/abc.1529362187.couch
+  data/shards/60000000-7fffffff/abc.1529362187.couch
+  data/shards/80000000-9fffffff/abc.1529362187.couch
+  data/shards/a0000000-bfffffff/abc.1529362187.couch
+  data/shards/c0000000-dfffffff/abc.1529362187.couch
+  data/shards/e0000000-ffffffff/abc.1529362187.couch
+
+Since they are just files, you can use ``cp``, ``rsync``,
+``scp`` or other command to copy them from one node to another. For
+example:
+
+.. code:: bash
+
+    # one one machine
+    $ mkdir -p data/shards/<range>
+    # on the other
+    $ scp <couch-dir>/data/shards/<range>/<database>.<datecode>.couch 
<node>:<couch-dir>/data/shards/<range>/
+
+Secondary indexes (including JavaScript views, Erlang views and Mango
+indexes) are also sharded, and their shards should be moved to save the
+new node the effort of rebuilding the view. View shards live in
+``data/.shards``. For example:
+
+::
+
+  data/.shards
+  data/.shards/e0000000-ffffffff/_replicator.1518451591_design
+  data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview
+  
data/.shards/e0000000-ffffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
+  data/.shards/c0000000-dfffffff
+  data/.shards/c0000000-dfffffff/_replicator.1518451591_design
+  data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview
+  
data/.shards/c0000000-dfffffff/_replicator.1518451591_design/mrview/3e823c2a4383ac0c18d4e574135a5b08.view
+  ...
+
+.. warning::
+    Technically, copying database and secondary index
+    shards is optional. If you proceed to the next step without
+    performing
+    this data copy, CouchDB will use internal replication to populate
+    the
+    newly added shard replicas. However, this process can be very slow,
+    especially on a busy cluster—which is why we recommend performing this
+    manual data copy first.
+
+Set the target node to ``true`` maintenance mode
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Before we tell CouchDB about the new shards on the node in question, we
+need to set the node to ``true`` maintenance mode. This special mode
+instructs CouchDB to return a ``404 Not Found`` response on the ``/_up``
+endpoint, and ensures it not participate in normal interactive clustered
+requests for its shards. A properly configured load balancer that uses
+``GET /_up`` to check the health of nodes will detect this 404 and
+remove that node from the backend target, preventing any HTTP requests
+from being sent to that node. An example HAProxy configuration to use
+the ``/_up`` endpoint is as follows:
+
+::
+
+  http-check disable-on-404
+  option httpchk GET /_up
+
+If you do not set maintenance mode, or the load balancer ignores this
+maintenance mode status, after the next step is performed the cluster
+may return incorrect responses when consulting the node in question. You
+don't want this! In the next steps, we will ensure that this shard is
+up-to-date before allowing it to participate in end-user requests.
+
+To set true maintenance mode:
+
+.. code::bash
+
+    $ curl -X PUT -H "Content-type: application/json" \
+        
http://localhost:5984/_node/<nodename>/_config/couchdb/maintenance_mode \
+        -d "\"true\""
+
+Then, verify that the node is in maintenance mode by performing a ``GET
+/_up`` on that node's individual endpoint:
+
+.. code::bash
+
+    $ curl -v http://localhost:5984/_up
+    …
+    < HTTP/1.1 404 Object Not Found
+    …
+    {"status":"maintenance_mode"}
+
+Finally, check that your load balancer has removed the node from the
+pool of available backend nodes.
+
+Updating cluster metadata to reflect the move
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Now we need to tell CouchDB that the target node (which must already be
+joined to the cluster) should be hosting shard replicas for a given
+database.
+
+To update the cluster metadata, use the special ``/_dbs`` database,
+which is an internal CouchDB database that maps databases to shards and
+nodes. It is accessible only via a node's private port, usually at port
+5986. By default, this port is only available on the localhost interface
+for security purposes.
+
+First, retrieve the database's current metadata:
+
+.. code:: bash
+
+    $ curl localhost:5986/_dbs/{name}
     {
-        "_id": "small",
-        "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
-        "shard_suffix": [
-            46,
-            49,
-            52,
-            50,
-            53,
-            50,
-            48,
-            50,
-            53,
-            55,
-            55
+      "_id": "{name}",
+      "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
+      "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
+      "changelog": [
+        ["add", "00000000-1fffffff", "[email protected]"],
+        ["add", "00000000-1fffffff", "[email protected]"],
+        ["add", "00000000-1fffffff", "[email protected]"],
+        …
+      ],
+      "by_node": {
+        "[email protected]": [
+          "00000000-1fffffff",
+          …
         ],
-        "changelog": [
-        [
-            "add",
-            "00000000-7fffffff",
-            "[email protected]"
+        …
+      },
+      "by_range": {
+        "00000000-1fffffff": [
+          "[email protected]",
+          "[email protected]",
+          "[email protected]"
         ],
-        [
-            "add",
-            "80000000-ffffffff",
-            "[email protected]"
-        ]
-        ],
-        "by_node": {
-            "[email protected]": [
-                "00000000-7fffffff",
-                "80000000-ffffffff"
-            ]
-        },
-        "by_range": {
-            "00000000-7fffffff": [
-                "[email protected]"
-            ],
-            "80000000-ffffffff": [
-                "[email protected]"
-            ]
-        }
+        …
+      }
     }
 
-* ``_id`` The name of the database.
-* ``_rev`` The current revision of the metadata.
-* ``shard_suffix`` The numbers after small and before .couch. This is seconds
-  after UNIX epoch when the database was created. Stored as ASCII characters.
-* ``changelog`` Self explaining. Mostly used for debugging.
-* ``by_node`` List of shards on each node.
-* ``by_range`` On which nodes each shard is.
+Here is a brief anatomy of that document:
+
+-  ``_id``: The name of the database.
+-  ``_rev``: The current revision of the metadata.
+-  ``shard_suffix``: A timestamp of the database's creation, marked as
+   seconds after the Unix epoch mapped to the codepoints for ASCII
+   numerals.
+-  ``changelog``: History of the database's shards.
+-  ``by_node``: List of shards on each node.
+-  ``by_range``: On which nodes each shard is.
+
+To reflect the shard move in the metadata, there are three steps:
 
-Nothing here, nothing there, a shard in my sleeve
--------------------------------------------------
+1. Add appropriate changelog entries.
+2. Update the ``by_node`` entries.
+3. Update the ``by_range`` entries.
 
-Start node2 and add it to the cluster. Check in ``/_membership`` that the
-nodes are talking with each other.
+As of this writing, this process must be done manually. **WARNING: Be
+very careful! Mistakes during this process can irreperably corrupt the
+cluster!**
 
-If you look in the directory ``data`` on node2, you will see that there is no
-directory called shards.
+To add a shard to a node, add entries like this to the database
+metadata's ``changelog`` attribute:
 
-Use curl to change the ``_dbs/small`` node-local document for small, so it
-looks like this:
+.. code:: json
 
-.. code-block:: javascript
+    ["add", "<range>", "<node-name>"]
+
+The ``<range>`` is the specific shard range for the shard. The ``<node-
+name>`` should match the name and address of the node as displayed in
+``GET /_membership`` on the cluster.
+
+.. warning::
+    When removing a shard from a node, specifying ``remove`` instead of 
``add``.
+
+Once you have figured out the new changelog entries, you will need to
+update the ``by_node`` and ``by_range`` to reflect who is storing what
+shards. The data in the changelog entries and these attributes must
+match. If they do not, the database may become corrupted.
+
+Continuing our example, here is an updated version of the metadata above
+that adds shards to an additional node called ``node4``:
+
+.. code:: json
 
     {
-        "_id": "small",
-        "_rev": "1-5e2d10c29c70d3869fb7a1fd3a827a64",
-        "shard_suffix": [
-            46,
-            49,
-            52,
-            50,
-            53,
-            50,
-            48,
-            50,
-            53,
-            55,
-            55
-        ],
-        "changelog": [
-        [
-            "add",
-            "00000000-7fffffff",
-            "[email protected]"
-        ],
-        [
-            "add",
-            "80000000-ffffffff",
-            "[email protected]"
-        ],
-        [
-            "add",
-            "00000000-7fffffff",
-            "[email protected]"
+      "_id": "{name}",
+      "_rev": "1-e13fb7e79af3b3107ed62925058bfa3a",
+      "shard_suffix": [46, 49, 53, 51, 48, 50, 51, 50, 53, 50, 54],
+      "changelog": [
+        ["add", "00000000-1fffffff", "[email protected]"],
+        ["add", "00000000-1fffffff", "[email protected]"],
+        ["add", "00000000-1fffffff", "[email protected]"],
+        …
+        ["add", "00000000-1fffffff", "[email protected]"]
+      ],
+      "by_node": {
+        "[email protected]": [
+          "00000000-1fffffff",
+          …
         ],
-        [
-            "add",
-            "80000000-ffffffff",
-            "[email protected]"
+        …
+        "[email protected]": [
+          "00000000-1fffffff"
         ]
+      },
+      "by_range": {
+        "00000000-1fffffff": [
+          "[email protected]",
+          "[email protected]",
+          "[email protected]",
+          "[email protected]"
         ],
-        "by_node": {
-            "[email protected]": [
-                "00000000-7fffffff",
-                "80000000-ffffffff"
-            ],
-            "[email protected]": [
-                "00000000-7fffffff",
-                "80000000-ffffffff"
-            ]
-        },
-        "by_range": {
-            "00000000-7fffffff": [
-                "[email protected]",
-                "[email protected]"
-            ],
-            "80000000-ffffffff": [
-                "[email protected]",
-                "[email protected]"
-            ]
-        }
+        …
+      }
     }
 
-After PUTting this document, it's like magic: the shards are now on node2 too!
-We now have ``n=2``!
+Now you can ``PUT`` this new metadata:
 
-If the shards are large, then you can copy them over manually and only have
-CouchDB sync the changes from the last minutes instead.
+.. code:: bash
 
-.. _cluster/sharding/move:
+    $ curl -X PUT $COUCH_URL:5986/_dbs/{name} -d '{...}'
+
+Monitor internal replication to ensure up-to-date shard(s)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+After you complete the previous step, as soon as CouchDB receives a
+write request for a shard on the target node, CouchDB will check if the
+target node's shard(s) are up to date. If it finds they are not up to
+date, it will trigger an internal replication job to complete this task.
+You can observe this happening by triggering a write to the database
+(update a document, or create a new one), while monitoring the
+``/_node/<nodename>/_system`` endpoint, which includes the
+``internal_replication_jobs`` metric.
+
+Once this metric has returned to the baseline from before you wrote the
+document, or is zero (0., it is safe to proceed.
 
 Review comment:
   I think this should be "zero (0)"?  Or maybe we can simplify to:
   
       document, or is ``0``, it is safe to proceed.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] flimzy commented on a change in pull request #268: Rewrite sharding documentation

Reply via email to