>> Are you using the durability term strictly in the area of transactions >> or in a sense that a successful write survives a system crash? > > Durability has all sorts of interesting characteristics. In general, it > means that after something bad, committed transactions persist. The > questions are around how bad bad is. > > At the minimum, durability requires that writes be written to persistent > storage. A more rigorous approach requires that a write be safe from any > single point of failure. Note that simply forcing writes to disk, even with > a non-buffered log, provides no immunity from a disk crash. That plus RAID, > probably, depending on whether the RAID implementation can be persuaded to > write a block on at least two devices before reporting success. > > Another approach is to ensure that writes are transmitted two multiple > systems. This opens more options. Does it require confirmation of receipt?
Interesting question. I would say it depends on "tunable" (I'm sorry *g*) consistency for writes in a distributed system. I must admit, I'm in a Cassandra "business" for more than 2 years now, but interesting to here thoughts on that in general. ;-) Imagine a replication factor (RF) of 3, which basically means, each row/replica lives on 3 physical/regional separated nodes. A write request from a client perspective at a consistency level of e.g. ONE will succeed if a single node acknowledge a successful write, where successful means writing the data change at least into a "commit log" persisted on disk. A consistency level of ALL would mean all nodes/replicas have to acknowledge success from a client perspective. And there is QUORUM requests defined as: (RF / 2 + 1) in our example = 2, thus 2 nodes need to respond. While <= RF nodes need to respond for a successful write from a client perspective depending on the chosen consistency level, a write is asynchronously forwarded to all replicas/nodes not from the client, but from the coordinator for that single request. > Is each recipient itself required to write the data to persistent storage? From a client perspective, it depends on the requested consistency level. > Or if the distributed system has a mechanism to recover and flood all systems > with messages sent from a crashed system, does that count? From the write example above, an alive node may store/record missed replica transmissions to other nodes for a finite time period, which gets replayed within that time period if the other node(s) comes alive again. > > When you introduce regions (or coteries), things get even more complication. > To survive a network partition, a commit has to be broadcast to at least one > node in every region/coterie. Right. "Replica placement" must be data-center/rack-aware. E.g. in a AWS-based cloud deployment with 3 availability zones (AVZ) per region and a replication factor of 3, the ideal replica placement would be 1 per availability zone, to fulfill QUORUM requests even if an entire AVZ is down/not-reachable for some reason. The next step would be to replicate across several data-centers or in AWS terms, regions, e.g. from us-east to eu-west, but network traffic might get pricy and replica latency is another topic here. > NuoDB has basically all of the above (though I'm not sure which are > completely documented and shipping). You can configuration how many storage > managers have to receive an update and whether those storage managers have to > flush the updates to a log before transmitted the acknowledgement. > > Interbase V1 reported commit after all updates had been written to disk. As > an option, a journal server could be configured, in which case all data would > be written twice (on different devices) before a transaction could be > considered committed. > > Borland didn't understand the concept of "single point of failure" when they > tried to turn the journal code into a write ahead log and ended up with > neither. This was part of the V5 (?) disaster. > >> >> If later, Cassandra has durable writes by first persisting a write >> operation into a commit log on disk before acknowledging a write >> operation as successful to the client. > > If you want immunity from a single point of failure, this doesn't in general > hack it. > For open source, probably good enough -- I don't know of anyone who does > better. For a commercial system, probably a non-starter. Right, for a single node environment, but who wants to run a distributed environment on a single node? And even in a distributed environment, there must not be a single-point of failure due to an architecture e.g. with some sort of master concept. Each node must be equated. Thanks, Thomas > >> >> >>> It's completely legitimate to have settable durability >>> (NuoDB has more options than you can shake a stick at), but at the end >>> of the day, a transaction is either committed or it must appear that it >>> never existed. >>> >>> For me, consistency is dirt simple: A transaction sees a stable view of >>> the database but can't update a version of a record it could not see as >>> well as enforcement of any declared consistency constraints. A >>> corollary is that any and all reduced consistency modes are for the >>> birds and exist solely because record locking database systems cannot >>> perform without them. >>> >>> I invented MVCC to bring transactions to the masses; to make >>> transactions easier to use than not. All of the other crap are bandaids >>> on systems that essentially don't work. >>> >>> I learned at DEC that most options are the product of bad design where >>> the designer didn't have a right answer so gave the users a choice >>> between two bad answers. Bah! >> >> Why has NuoDB more durability options than you can shake a stick at it >> then? ;-) > > Two reasons. One is a legitimate tradeoff between commit latency and > paranoia. The other is that many organizations with large bank accounts > require that updates be written twice before a commit is reported. This > isn't theoretically or practically necessary, but that's their policy. > Easier to implement than argue against. > > >> >> Na, seriously, thanks for your insights. >> >> Regards, >> Thomas >> >> ------------------------------------------------------------------------------ >> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server >> from Actuate! Instantly Supercharge Your Business Reports and Dashboards >> with Interactivity, Sharing, Native Excel Exports, App Integration & more >> Get technology previously reserved for billion-dollar corporations, FREE >> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk >> Firebird-Devel mailing list, web interface at >> https://lists.sourceforge.net/lists/listinfo/firebird-devel > > ------------------------------------------------------------------------------ > Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server > from Actuate! Instantly Supercharge Your Business Reports and Dashboards > with Interactivity, Sharing, Native Excel Exports, App Integration & more > Get technology previously reserved for billion-dollar corporations, FREE > http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk > Firebird-Devel mailing list, web interface at > https://lists.sourceforge.net/lists/listinfo/firebird-devel > ------------------------------------------------------------------------------ Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server from Actuate! Instantly Supercharge Your Business Reports and Dashboards with Interactivity, Sharing, Native Excel Exports, App Integration & more Get technology previously reserved for billion-dollar corporations, FREE http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel