Well, it is kind of Riak specific. An implementation that treated DELETEs like PUTs (tombstones w/ vector clocks for ordering), then this would not be an issue, right? When no primary nodes are down, the tombstones can be physically deleted on the backend. A logical delete could never reappear if that were how it worked.
Is this essentially what is on the current master branch (not yet released)? On Thursday, June 16, 2011 at 8:22 AM, Nico Meyer wrote: > The problem with unreachable nodes still remains, since you don't know > how long they will be gone. The only 'safe' minimum time to keep deleted > values is forever. This can be easily emulated in the application layer > by using a special value (or use Riak metadata for example). > So it essentially a trade off like most things. If you are sure that no > node will ever be down for more than 24 hours, your solution would work. > > If it is really essential for an application that deleted keys don't > ever reappear, you should just store this information explicitly (that > way you also know when the key was deleted btw.). If not, then one can > live with the current behaviour, which is much simpler implementation wise. > > I would just separate the two issues of logically deleting and > physically deleting (which is just an operational issue as opposed to an > issue for your application design). The latter could be handled by the > storage backend. Bitcask already has a key expiration feature. If it > where fixed, so that expired key are actually counted towards the > triggering of merges, and the ttl could be set per key, you would be > good to go ;-). > > Btw, this whole issue is not really Riak specific. It is essentially a > consequence of eventual consistency, where you have to make a trade off > between the amount of bookkeeping information you want to store and the > maximum amount of time (or number of updates) any part of the system can > diverge from the rest of the system before you get undesired results. > > Cheers, > Nico > > Am 16.06.2011 16:50, schrieb Kresten Krab Thorup: > > ...when doing a delete, Riak actually stores a "deleted" record, but then > > it is too eagerly deleting it for real after that. There should be a > > configurable "zombie time" between requesting a delete and the "deleted > > record" being deleted for real; so that the deleted record's vector clock > > will show that the delete is more recent than the other value(s) in case > > those are later reconciled. The current infrastructure just doesn't have a > > good place to "enqueue" such a "delete this for real in 24 hours"-ish > > request. > > > > Also, the master branch now has support for specifying a vector clock with > > a delete (in 14.x releases you can in stead do a PUT w/ X-Riak-Deleted=true > > and a proper vector clock, and an empty content). That's better (more > > consistent), but not a real fix. > > > > Kresten > > > > On 16/06/2011, at 11.58, "Nico > > Meyer"<nico.me...@adition.com<mailto:nico.me...@adition.com>> wrote: > > > > Hello David, > > > > this behaviour is quite expected if you think about how Riak works. > > Assuming you use the default replication factor of n=3, each key is stored > > on all of your three nodes. If you delete a key while one node (let's call > > it A) is down, the key is deleted from the two nodes that are still up > > (let's call them B and C), and remains on the downed node A. > > Once node A is up again, the situation is indistinguishable from B and C > > having a hard drive crash and loosing all their data, in that A has the key > > and B and C know nothing about it. > > > > If you do a GET of the deleted key at this point, the result depends on the > > r-value that you choose. For r>1 you will get a not_found on the first get. > > For r=1 you might get the data or a not_found, depending on which two nodes > > answer first (see<https://issues.basho.com/show_bug.cgi?id=992> > > https://issues.basho.com/show_bug.cgi?id=992 about basic quorum for an > > explanation). Also, at that point read repair will kick in and re-replicate > > the key to all nodes, so subsequent GETs will always return the original > > datum. > > > > listing keys on the other hand does not use quorum but just does a set > > union of all keys of all the nodes in you cluster. Essentially it is > > equivalent to r=1 without basic quorum. The same is true for map/reduce > > queries to my knowledge > > > > The essential problem is that a real physical delete is indistinguishable > > from data loss (or never having had the data in the first place), while > > those two things are logically different. > > If you want to be sure that a key is deleted with all its replicas you must > > delete it with a write quorum setting of w=n. Also you need to tell Riak > > not to count fallback vnodes toward you write quorum. This feature is quite > > new and I believe only available in the head revision. Also I forgot the > > name of the parameter and don't know if it is even applicable for DELETEs. > > Anyhow, if you do all this, your DELETEs will simply fail if any of the > > nodes that has a copy of the key is down (so in your case, if any node is > > down). > > > > If you only want to logically delete, and don't care about freeing the disk > > space and RAM that is used by the key, you should use a special value, > > which is interpreted by your application as a not found. That way you also > > get proper conflict resolution between DELETEs and PUTs (say one client > > deletes a key while another one updates it). > > > > Cheers, > > Nico > > > > Am 16.06.2011 00:55, schrieb David Mitchell: > > Erlang: R13B04 > > Riak: 0.14.2 > > > > I have a three node cluster, and while one node was down, I deleted every > > key in a certain bucket. Then, I started the node that was down, and it > > joined the cluster. > > > > Now, when do a listing on these keys in this bucket, and I get the entire > > list. I can also get the values of the bucket. However, when I try to > > delete the keys, the keys are not deleted. > > > > Can anyone help me get the nodes back in a consistent state? I have tried > > restarting the nodes. > > > > David > > > > > > > > > > > > > > _______________________________________________ > > riak-users mailing list > > <mailto:riak-users@lists.basho.com>riak-users@lists.basho.com<mailto:riak-users@lists.basho.com> > > <http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com>http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > > > > _______________________________________________ > > riak-users mailing list > > riak-users@lists.basho.com<mailto:riak-users@lists.basho.com> > > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com > > > _______________________________________________ > riak-users mailing list > riak-users@lists.basho.com (mailto:riak-users@lists.basho.com) > http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________ riak-users mailing list riak-users@lists.basho.com http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com