Okay. Please let me know the riak config parameters or other parameters you think could make the recovery faster. For example, the transfer-limit which can be changed used the riak-admin transfer-limit command.
Thanks, Leo On Tue, Sep 19, 2017 at 2:23 PM, Bryan Hunt <bryan.h...@erlang-solutions.com> wrote: > Sorry Leo, > > That’s completely impossible to guess :-D > > Factors include - I/O, Network cards, network switch, selinux, block size, > CPU, size of objects, number of objects, CRDT, Riak version, etc… > > Best, > > Bryan > >> On 19 Sep 2017, at 18:53, Leo <scicompl...@gmail.com> wrote: >> >> Dear Bryan, >> >> Thank you very much for your answers. They are very helpful to me. >> I will use more nodes (>=5) in future. >> >> From your experience with using Riak, what would your guess be for the >> time taken to finish all the AAE transfers and be done with the >> recovery for about 1 TB worth of data (assuming my cluster is >> otherwise completely idle without any user accessing the cluster >> during this process and that I am continuously watching the transfers >> and re-enabling disabled AAE trees gradually )? I am just asking for >> rough estimate from your past experience ( please quote from your >> experience with a difference sized cluster / data size too ). My guess >> is that it will take approx. 2 days or more. Do you concur? >> >> Thanks, >> Leo >> >> >> On Tue, Sep 19, 2017 at 12:41 PM, Bryan Hunt >> <bryan.h...@erlang-solutions.com> wrote: >>> (0) Three nodes are insufficient, you should have 5 nodes >>> (1) You could iterate and read every object in the cluster - this would also >>> trigger read repair for every object >>> (2) - copied from Engel Sanchez response to a similar question April 10th >>> 2014 ) >>> >>> * If AAE is disabled, you don't have to stop the node to delete the data in >>> the anti_entropy directories >>> * If AAE is enabled, deleting the AAE data in a rolling manner may trigger >>> an avalanche of read repairs between nodes with the bad trees and nodes >>> with good trees as the data seems to diverge. >>> >>> If your nodes are already up, with AAE enabled and with old incorrect trees >>> in the mix, there is a better way. You can dynamically disable AAE with >>> some console commands. At that point, without stopping the nodes, you can >>> delete all AAE data across the cluster. At a convenient time, re-enable >>> AAE. I say convenient because all trees will start to rebuild, and that >>> can be problematic in an overloaded cluster. Doing this over the weekend >>> might be a good idea unless your cluster can take the extra load. >>> >>> To dynamically disable AAE from the Riak console, you can run this command: >>> >>>> riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, disable, [], >>> 60000). >>> >>> and enable with the similar: >>> >>>> riak_core_util:rpc_every_member_ann(riak_kv_entropy_manager, enable, [], >>> 60000). >>> >>> That last number is just a timeout for the RPC operation. I hope this >>> saves you some extra load on your clusters. >>> >>> (3) That’s going to be : >>> (3a) List all keys using the client of your choice >>> (3b) Fetch each object >>> >>> https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/reading-objects/ >>> >>> https://www.tiot.jp/riak-docs/riak/kv/2.2.3/developing/usage/secondary-indexes/ >>> >>> >>> >>> >>> >>> >>> >>> On 19 Sep 2017, at 18:31, Leo <scicompl...@gmail.com> wrote: >>> >>> Dear Riak users and experts, >>> >>> I really appreciate any help with my questions below. >>> >>> I have a 3 node Riak cluster with each having approx. 1 TB disk usage. >>> All of a sudden, one node's hard disk failed unrecoverably. So, I >>> added a new node using the following steps: >>> >>> 1) riak-admin cluster join 2) down the failed node 3) riak-admin >>> force-replace failed-node new-node 4) riak-admin cluster plan 5) >>> riak-admin cluster commit. >>> >>> This almost fixed the problem except that after lots of data transfers >>> and handoffs, now not all three nodes have 1 TB disk usage. Only two >>> of them have 1 TB disk usage. The other one is almost empty (few 10s >>> of GBs). This means there are no longer 3 copies on disk anymore. My >>> data is completely random (no two keys have same data associated with >>> them. So, compression of data cannot be the reason for less data on >>> disk), >>> >>> I also tried using the "riak-admin cluster replace failednode newnode" >>> command so that the leaving node handsoff data to the joining node. >>> This however is not helpful if the leaving node has a failed hard >>> disk. I want the remaining live vnodes to help the new node recreate >>> the lost data using their replica copies. >>> >>> I have three questions: >>> >>> 1) What commands should I run to forcefully make sure there are three >>> replicas on disk overall without waiting for read-repair or >>> anti-entropy to make three copies ? Bandwidth usage or CPU usage is >>> not a huge concern for me. >>> >>> 2) Also, I will be very grateful if someone lists the commands that I >>> can run using "riak attach" so that I can clear the AAE trees and >>> forcefully make sure all data has 3 copies. >>> >>> 3) I will be very thankful if someone helps me with the commands that >>> I should run to ensure that all data has 3 replicas on disk after the >>> disk failure (instead of just looking at the disk space usage in all >>> the nodes as hints)? >>> >>> Thanks, >>> Leo >>> >>> _______________________________________________ >>> riak-users mailing list >>> riak-users@lists.basho.com >>> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com >>> >>> > _______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com