Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)

Haudy Kazemi Wed, 01 Jul 2009 05:12:34 -0700

Ian Collins wrote:

David Magda wrote:
On Jun 30, 2009, at 14:08, Bob Friesenhahn wrote:
I have seen UPSs help quite a lot for short glitches lastingseconds, or a minute. Otherwise the outage is usually longer thanthe UPSs can stay up since the problem required human attention.
A standby generator is needed for any long outages.
Can't remember where I read the claim, but supposedly if power isn'trestored within about ten minutes, then it will probably be out for afew hours. If this 'statistic' is true, it would mean that your UPSshould last (say) fifteen minutes, and after that you really need agenerator.
Or run your systems of DC and get as much backup as you have room (andbudget!) for batteries. I once visited a central exchange with 48hours of battery capacity...

The way Google handles UPSes is to have a small 12v battery integratedwith each PC power supply. When the machine is on, the battery has itscharged maintained. Not unlike a laptop in that it has a built inbattery backup, but using an inexpensive sealed lead acid batteryinstead of lithium ion. Here is info along with photos of the Googleserver internals:

http://news.cnet.com/8301-1001_3-10209580-92.html
http://willysr.blogspot.com/2009/04/googles-server-design.html

(IIRC there have been power supply UPSes since at least the late 1980swhich had an internal battery. Either that or they were UPSes that fitinside the standard PC (AT) compatible desktop case, making the powerprotection system entirely internal to the computer. I think I sawthese models one time while browsing late 1980s or early 1990s issues ofPC Magazine that reviewed UPSes. They still exist...one company sellingthem is http://www.globtek.com/html/ups.html . A Google search for'power supply built in UPS' would likely find more.)

I also did additional searches in the zfs-discuss archives and found athread from mid-February, which lead me to other threads. It looks likethere are still scattered instances where ZFS has not recoveredgracefully from power failures or other failures, where it becamenecessary to perform a manual transaction group (txg) rollback. Here isa consolidated list of links related to manual uberblock transactiongroup (txg) rollback and similar ZFS data recovery guides, includingundeleting:


Section 1: Nathan Hand's guide and related thread
Nathan Hand's guide to invalidating uberblocks (Dec 2008 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=85794
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg22153.html


Section 2. Victor Latushkin's guide and related threads

Thread: zpool unimportable (corrupt zpool metadata??) but no zdb -ldevice problems (Oct 2008 to Feb 2009 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=76960
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19839.html

Repair report: Re: Solved - a big THANKS to Victor Latushkin @ Sun / Moscow
http://www.opensolaris.org/jive/message.jspa?messageID=289537#289537

Some recovery discussion by Victor: "zdb -bv alone took several hours towalk the block tree"

http://www.opensolaris.org/jive/message.jspa?messageID=292991#292991

orhttp://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022365.html

or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20095.html

Victor Latushkin's guide: "Thanks to COW nature of ZFS it was possibleto successfully recover pool state which was only 5 seconds older thanlast unopenable one."

http://mail.opensolaris.org/pipermail/zfs-discuss/2008-October/022331.html
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg20061.html


Section 3: reliability debates, recovery tool planning, uberblock info

Thread: Availability: ZFS needs to handle disk removal / driver failurebetter (August 2008 thread)

http://www.opensolaris.org/jive/thread.jspa?threadID=70811
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg19057.html

Thread: ZFS: unreliable for professional usage? (Feb 2009 thread)
http://www.opensolaris.org/jive/thread.jspa?threadID=91426
or http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23833.html

Richard Elling's post that "uberblocks are kept in an 128-entry circularqueue which is 4x redundant with 2 copies each at the beginning and endof the vdev. Other metadata, by default, is 2x redundant and spatiallydiverse."

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg24145.html

Jeff Bonwick's post about Bug ID 6667683
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg23961.html

Bug ID 6667683: need a way to rollback to an uberblock from a previous txg

Description: If we are unable to open the pool based on the most recentuberblock then it might be useful to try an older txg uberblock as itmight provide a better view of the world. Having a utility to reset theuberblock to a previous txg might provide a nice recovery mechanism.

http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6667683

Uberblock information
http://blogs.sun.com/blogfinger/entry/zfs_and_the_uberblock
http://blogs.sun.com/blogfinger/entry/zfs_and_the_uberblock_part


Section 4: undeleting

Recovering removed file on zfs disk using a modified mdb and zdb (i.e.undelete)

http://mbruning.blogspot.com/2008/08/recovering-removed-file-on-zfs-disk.html

Re: [zfs-discuss] Forensic analysis [was: more ZFS recovery] (listedbecause forensic analysis tools often overlap with undeletion tools/datarecovery tools)

http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18557.html
http://opensolaris.org/os/project/forensics/ZFS-Forensics/


Thanks everyone for the input you've given so far.

-hk
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, power failures, and UPSes (and ZFS recovery guide links)

Reply via email to