Re: [HACKERS] quick review

Christopher Browne Sun, 24 Dec 2006 19:06:52 -0800

A long time ago, in a galaxy far, far away, [EMAIL PROTECTED] ("Dawid 
Kuroczko") wrote:
> On 12/24/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On Mon, Dec 18, 2006 at 03:47:42AM +0100, Molle Bestefich wrote:
>>
>> [...]
>>
>> > Simply put, a tool with just a single button named "recover
>> > all the data that you can" is by far the best solution in so
>> > many cases.  Minimal fuzz, minimal downtime, minimal money
>> > spent on recovery.  And perhaps there's even a good chance that
>> > any missing data could be entered back into the system manually.
>>
>> I think the point which has been made here was that the recovery tool
>> *is already there*: i.e. all what can be done as an "one-click" recovery
>> is done by the system at start-up. Beyond this no cookbook exists (and
>> thus no way to put it under an one-click procedure).
>>
>> So this one-click thing would be mainly something to cater for the
>> "needs" of marketing.
>
> Well start-up recovery is great and reliable.  The only problem is that
> it won't help if you have some obscure hardware problem, you really
> have a problem.  If you want to sleep well, you should know what to
> do when disaster happens.
>
> I really like the approach of XFS filesystem, which ships with fsck.xfs
> which is essentially equivalent to /bin/true.  They write in their white
> paper that they did so, because journaling should recover from all
> failures.  Yet they also wrote that some time after they learned that
> hardware corruption is not as unlikely as one might assume, so they
> provide xfs_check an xfs_repair utilities.
>
> I think there should be a documented way to recover from obscure
> hardware failure, with even more detailed information how this could
> result only from using crappy hardware...  And I don't think this should
> be "one click" process -- some people might miss real (software)
> corruption, and this is a biggest drawback.  Perhaps the disaster
> recoverer should leave a detailed log which would be enough to
> detect software-corruption even after the recovery [and users should
> be advised to send them].


The trouble is that it is often *impossible* to recover from the
"obscure hardware failure."

If the failure is that a bunch of vital bits have been lost or
scribbled on, there may be NO way to recover from this.

And in practice, this in fact seems to be a common form for "obscure
hardware failure" to take: those problems are, in fact, irretrievable.

There historically have been two main sorts of corruptions:

1.  Hardware corruptions where the only recovery is to have some sort
of replica of the data, whether via near-hardware mechanisms (e.g. -
RAID) or more 'logical' mechanisms (e.g. - replication systems).

2.  Software corruptions, where the answer is not to provide some
"recovery mechanism," but rather to FIX THE BUG that is leading to the
problem.  Once the bug is fixed, there is no more corruption (of this
sort).

Neither of these is amenable to there being some mechanism such as you
describe.  There are really only two possibilities:

 a) The problem is one that the WAL recovery system can cope with, or

 b) There has been True Data Loss, and there is NO recovery system
    short of recovering from backup/replica.
-- 
output = ("cbbrowne" "@" "acm.org")
http://cbbrowne.com/info/slony.html
"Here I  am, brain the  size of a planet, and  they ask me to take you
down the the bridge.  Call that job satisfaction?  'Cos I don't."
-- Marvin the Paranoid Android

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [HACKERS] quick review

Reply via email to