>> Are you using the durability term strictly in the area of transactions
>> or in a sense that a successful write survives a system crash?
>
> Durability has all sorts of interesting characteristics.  In general, it 
> means that after something bad, committed transactions persist.  The 
> questions are around how bad bad is.
>
> At the minimum, durability requires that writes be written to persistent 
> storage.  A more rigorous approach requires that a write be safe from any 
> single point of failure.  Note that simply forcing writes to disk, even with 
> a non-buffered log, provides no immunity from a disk crash.  That plus RAID, 
> probably, depending on whether the RAID implementation can be persuaded to 
> write a block on at least two devices before reporting success.
>
> Another approach is to ensure that writes are transmitted two multiple 
> systems.  This opens more options.  Does it require confirmation of receipt?

Interesting question. I would say it depends on "tunable" (I'm sorry 
*g*) consistency for writes in a distributed system. I must admit, I'm 
in a Cassandra "business" for more than 2 years now, but interesting to 
here thoughts on that in general. ;-)

Imagine a replication factor (RF) of 3, which basically means, each 
row/replica lives on 3 physical/regional separated nodes.

A write request from a client perspective at a consistency level of e.g. 
ONE will succeed if a single node acknowledge a successful write, where 
successful means writing the data change at least into a "commit log" 
persisted on disk. A consistency level of ALL would mean all 
nodes/replicas have to acknowledge success from a client perspective. 
And there is QUORUM requests defined as:

(RF / 2 + 1)

in our example = 2, thus 2 nodes need to respond.

While <= RF nodes need to respond for a successful write from a client 
perspective depending on the chosen consistency level, a write is 
asynchronously forwarded to all replicas/nodes not from the client, but 
from the coordinator for that single request.

> Is each recipient itself required to write the data to persistent storage?

 From a client perspective, it depends on the requested consistency level.

> Or if the distributed system has a mechanism to recover and flood all systems 
> with messages sent from a crashed system, does that count?

 From the write example above, an alive node may store/record missed 
replica transmissions to other nodes for a finite time period, which 
gets replayed within that time period if the other node(s) comes alive 
again.

>
> When you introduce regions (or coteries), things get even more complication.  
> To survive a network partition, a commit has to be broadcast to at least one 
> node in every region/coterie.

Right. "Replica placement" must be data-center/rack-aware. E.g. in a 
AWS-based cloud deployment with 3 availability zones (AVZ) per region 
and a replication factor of 3, the ideal replica placement would be 1 
per availability zone, to fulfill QUORUM requests even if an entire AVZ 
is down/not-reachable for some reason. The next step would be to 
replicate across several data-centers or in AWS terms, regions, e.g. 
from us-east to eu-west, but network traffic might get pricy and replica 
latency is another topic here.

> NuoDB has basically all of the above (though I'm not sure which are 
> completely documented and shipping).  You can configuration how many storage 
> managers have to receive an update and whether those storage managers have to 
> flush the updates to a log before transmitted the acknowledgement.
>
> Interbase V1 reported commit after all updates had been written to disk.  As 
> an option, a journal server could be configured, in which case all data would 
> be written twice (on different devices) before a transaction could be 
> considered committed.
>
> Borland didn't understand the concept of "single point of failure" when they 
> tried to turn the journal code into a write ahead log and ended up with 
> neither.  This was part of the V5 (?) disaster.
>
>>
>> If later, Cassandra has durable writes by first persisting a write
>> operation into a commit log on disk before acknowledging a write
>> operation as successful to the client.
>
> If you want immunity from a single point of failure, this doesn't in general 
> hack it.

> For open source, probably good enough -- I don't know of anyone who does 
> better.  For a commercial system, probably a non-starter.

Right, for a single node environment, but who wants to run a distributed 
environment on a single node?

And even in a distributed environment, there must not be a single-point 
of failure due to an architecture e.g. with some sort of master concept. 
Each node must be equated.

Thanks,
Thomas



>
>>
>>
>>> It's completely legitimate to have settable durability
>>> (NuoDB has more options than you can shake a stick at), but at the end
>>> of the day, a transaction is either committed or it must appear that it
>>> never existed.
>>>
>>> For me, consistency is dirt simple: A transaction sees a stable view of
>>> the database but can't update a version of a record it could not see as
>>> well as enforcement of any declared consistency constraints.  A
>>> corollary is that any and all reduced consistency modes are for the
>>> birds and exist solely because record locking database systems cannot
>>> perform without them.
>>>
>>> I invented MVCC to bring transactions to the masses; to make
>>> transactions easier to use than not.  All of the other crap are bandaids
>>> on systems that essentially don't work.
>>>
>>> I learned at DEC that most options are the product of bad design where
>>> the designer didn't have a right answer so gave the users a choice
>>> between two bad answers.  Bah!
>>
>> Why has NuoDB more durability options than you can shake a stick at it
>> then? ;-)
>
> Two reasons.  One is a legitimate tradeoff between commit latency and 
> paranoia.  The other is that many organizations with large bank accounts 
> require that updates be written twice before a commit is reported.  This 
> isn't theoretically or practically necessary, but that's their policy.  
> Easier to implement than argue against.
>
>
>>
>> Na, seriously, thanks for your insights.
>>
>> Regards,
>> Thomas
>>
>> ------------------------------------------------------------------------------
>> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
>> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
>> with Interactivity, Sharing, Native Excel Exports, App Integration & more
>> Get technology previously reserved for billion-dollar corporations, FREE
>> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
>> Firebird-Devel mailing list, web interface at 
>> https://lists.sourceforge.net/lists/listinfo/firebird-devel
>
> ------------------------------------------------------------------------------
> Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
> from Actuate! Instantly Supercharge Your Business Reports and Dashboards
> with Interactivity, Sharing, Native Excel Exports, App Integration & more
> Get technology previously reserved for billion-dollar corporations, FREE
> http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
> Firebird-Devel mailing list, web interface at 
> https://lists.sourceforge.net/lists/listinfo/firebird-devel
>

------------------------------------------------------------------------------
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration & more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631&iu=/4140/ostg.clktrk
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Reply via email to