Well, it is kind of Riak specific. An implementation that treated DELETEs like 
PUTs (tombstones w/ vector clocks for ordering), then this would not be an 
issue, right? When no primary nodes are down, the tombstones can be physically 
deleted on the backend. A logical delete could never reappear if that were how 
it worked.

Is this essentially what is on the current master branch (not yet released)?

On Thursday, June 16, 2011 at 8:22 AM, Nico Meyer wrote:

> The problem with unreachable nodes still remains, since you don't know 
> how long they will be gone. The only 'safe' minimum time to keep deleted 
> values is forever. This can be easily emulated in the application layer 
> by using a special value (or use Riak metadata for example).
> So it essentially a trade off like most things. If you are sure that no 
> node will ever be down for more than 24 hours, your solution would work.
> 
> If it is really essential for an application that deleted keys don't 
> ever reappear, you should just store this information explicitly (that 
> way you also know when the key was deleted btw.). If not, then one can 
> live with the current behaviour, which is much simpler implementation wise.
> 
> I would just separate the two issues of logically deleting and 
> physically deleting (which is just an operational issue as opposed to an 
> issue for your application design). The latter could be handled by the 
> storage backend. Bitcask already has a key expiration feature. If it 
> where fixed, so that expired key are actually counted towards the 
> triggering of merges, and the ttl could be set per key, you would be 
> good to go ;-).
> 
> Btw, this whole issue is not really Riak specific. It is essentially a 
> consequence of eventual consistency, where you have to make a trade off 
> between the amount of bookkeeping information you want to store and the 
> maximum amount of time (or number of updates) any part of the system can 
> diverge from the rest of the system before you get undesired results.
> 
> Cheers,
> Nico
> 
> Am 16.06.2011 16:50, schrieb Kresten Krab Thorup:
> > ...when doing a delete, Riak actually stores a "deleted" record, but then 
> > it is too eagerly deleting it for real after that. There should be a 
> > configurable "zombie time" between requesting a delete and the "deleted 
> > record" being deleted for real; so that the deleted record's vector clock 
> > will show that the delete is more recent than the other value(s) in case 
> > those are later reconciled. The current infrastructure just doesn't have a 
> > good place to "enqueue" such a "delete this for real in 24 hours"-ish 
> > request.
> > 
> > Also, the master branch now has support for specifying a vector clock with 
> > a delete (in 14.x releases you can in stead do a PUT w/ X-Riak-Deleted=true 
> > and a proper vector clock, and an empty content). That's better (more 
> > consistent), but not a real fix.
> > 
> > Kresten
> > 
> > On 16/06/2011, at 11.58, "Nico 
> > Meyer"<nico.me...@adition.com<mailto:nico.me...@adition.com>> wrote:
> > 
> > Hello David,
> > 
> > this behaviour is quite expected if you think about how Riak works.
> > Assuming you use the default replication factor of n=3, each key is stored 
> > on all of your three nodes. If you delete a key while one node (let's call 
> > it A) is down, the key is deleted from the two nodes that are still up 
> > (let's call them B and C), and remains on the downed node A.
> > Once node A is up again, the situation is indistinguishable from B and C 
> > having a hard drive crash and loosing all their data, in that A has the key 
> > and B and C know nothing about it.
> > 
> > If you do a GET of the deleted key at this point, the result depends on the 
> > r-value that you choose. For r>1 you will get a not_found on the first get. 
> > For r=1 you might get the data or a not_found, depending on which two nodes 
> > answer first (see<https://issues.basho.com/show_bug.cgi?id=992> 
> > https://issues.basho.com/show_bug.cgi?id=992 about basic quorum for an 
> > explanation). Also, at that point read repair will kick in and re-replicate 
> > the key to all nodes, so subsequent GETs will always return the original 
> > datum.
> > 
> > listing keys on the other hand does not use quorum but just does a set 
> > union of all keys of all the nodes in you cluster. Essentially it is 
> > equivalent to r=1 without basic quorum. The same is true for map/reduce 
> > queries to my knowledge
> > 
> > The essential problem is that a real physical delete is indistinguishable 
> > from data loss (or never having had the data in the first place), while 
> > those two things are logically different.
> > If you want to be sure that a key is deleted with all its replicas you must 
> > delete it with a write quorum setting of w=n. Also you need to tell Riak 
> > not to count fallback vnodes toward you write quorum. This feature is quite 
> > new and I believe only available in the head revision. Also I forgot the 
> > name of the parameter and don't know if it is even applicable for DELETEs.
> > Anyhow, if you do all this, your DELETEs will simply fail if any of the 
> > nodes that has a copy of the key is down (so in your case, if any node is 
> > down).
> > 
> > If you only want to logically delete, and don't care about freeing the disk 
> > space and RAM that is used by the key, you should use a special value, 
> > which is interpreted by your application as a not found. That way you also 
> > get proper conflict resolution between DELETEs and PUTs (say one client 
> > deletes a key while another one updates it).
> > 
> > Cheers,
> > Nico
> > 
> > Am 16.06.2011 00:55, schrieb David Mitchell:
> > Erlang: R13B04
> > Riak: 0.14.2
> > 
> > I have a three node cluster, and while one node was down, I deleted every 
> > key in a certain bucket. Then, I started the node that was down, and it 
> > joined the cluster.
> > 
> > Now, when do a listing on these keys in this bucket, and I get the entire 
> > list. I can also get the values of the bucket. However, when I try to 
> > delete the keys, the keys are not deleted.
> > 
> > Can anyone help me get the nodes back in a consistent state? I have tried 
> > restarting the nodes.
> > 
> > David
> > 
> > 
> > 
> > 
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > <mailto:riak-users@lists.basho.com>riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> > <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> > 
> > 
> > _______________________________________________
> > riak-users mailing list
> > riak-users@lists.basho.com<mailto:riak-users@lists.basho.com>
> > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
> 
> 
> _______________________________________________
> riak-users mailing list
> riak-users@lists.basho.com (mailto:riak-users@lists.basho.com)
> http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to