On Wed, Mar 21, 2012 at 3:24 PM, Tom Wilkie <t...@acunu.com> wrote: > Hi Edward > >> 1) No more raid 0. If a machine is responsible for 4 vnodes they >> should correspond to for JBOD. > > So each vnode corresponds to a disk? I suppose we could have a > separate data directory per disk, but I think this should be a > separate, subsequent change. > I think having more micro-ranges makes the process much easier. Image a token ring 1-30
Node1 | major range 0-10 | disk 1 0-2 , disk2 3-4, disk 3 5-7, disk 4 8-10 Node2 | major range 11-20| disk 1 11-12 , disk2 13-14, disk 3 15-17, disk 4 18-20 Node3 | major range 21-30| disk 1 21-22 , disk2 23-24, disk 3 25-27, disk 4 28-30 Adding a 4th node is easy: If you are at the data center, just take disk 4 out of each node and place it in new server :) Software wise it is the same deal. Each node streams off only disk 4 to the new node. Now at this point disk 4 is idle and each machine should re balance its own data across its 4 disks. > However, do note that making the vnode ~size of a disk (and only have > 4-8 per machine) would make any non-hotswap rebuilds slower. To get > the fast distributed rebuilds, you need to have at least as many > vnodes per node as you do nodes in the cluster. And you would still > need the distributed rebuilds to deal with disk failure. > >> 2) Vnodes should be able to be hot pluged. My normal cassandra chassis >> would be a 2U with 6 drive bays. Imagine I have 10 nodes. Now if my >> chassis dies I should be able to take the disks out and physically >> plug them into another chassis. Then in cassandra I should be able to >> run a command like. >> nodetool attach '/mnt/disk6'. disk6 should contain all data an it's >> vnode information. >> >> Now this would be awesome for upgrades/migrations/etc. > > You know, your not the first person I've spoke to who has asked for > this! I do wonder whether it is optimising for the right thing though > - in my experience, disks fail more often than machines. > > Thanks > > Tom