>>>>> "mp" == Mattias Pantzare <[EMAIL PROTECTED]> writes:

    >> This is a big one: ZFS can continue writing to an unavailable
    >> pool.  It doesn't always generate errors (I've seen it copy
    >> over 100MB before erroring), and if not spotted, this *will*
    >> cause data loss after you reboot.

    mp> This is not unique for zfs. If you need to know that your
    mp> writes has reached stable store you have to call fsync().

seconded.

How about this:

 * start the copy

 * pull the disk, without waiting for an error reported to the application

 * type 'lockfs -fa'.  Does either lockfs hang, or you get an
   immediate error after requesting the lockfs?

If so, I think it's ok and within the unix tradition to allow all
these writes, it's just maybe a more extreme version of the tradition,
which might not be an entirely bad compromise if ZFS can keep up this
behavior, and actually retry the unreported failed writes, when
confronted with FC, iSCSI, USB, FW targets that bounce.  I'm not sure
if it can ever do that yet or not, but architecturally I wouldn't want
to demand that it return failure to the app too soon, so long as
fsync() still behaves correctly w.r.t. power failures.


However the other problems you report are things I've run into, also.
'zpool status' should not be touching the disk at all.  so, we have:

 * 'zpool list' shows ONLINE several minutes after a drive is yanked.
   At the time 'zpool list' still shows ONLINE, 'zpool status' doesn't
   show anything at all because it hangs, so ONLINE seems too
   positive a report for the situation.  I'd suggest:

   + 'zpool list' should not borrow the ONLINE terminology from 'zpool
     status' if the list command means something different by the word
     ONLINE.  maybe SEEMS_TO_BE_AROUND_SOMEWHERE is more appropriate.

   + during this problem, 'zpool list' is available while 'zpool
     status' is not working.  Fine, maybe, during a failure, not all
     status tools will be available.  However it would be nice if, as
     a minimum, some status tool capable of reporting ``pool X is
     failing'' were available.  In the absence of that, you may have
     to reboot the machine without ever knowing even which pool failed
     to bring it down.

 * maybe sometimes certain types of status and statistics aren't
   available, but no status-reporting tools should ever be subject to
   blocking inside the kernel.  At worst they should refuse to give
   information, and return to a prompt, immediately.  I'm in the habit
   of typing 'zpool status &' during serious problems so I don't lose
   control of the console.

 * 'zpool status' is used when things are failing.  Cabling and driver
   state machines are among the failures from which a volume manager
   should protect us---that's why we say ``buy redundant controllers
   if possible.''

   In this scenario, a read is an intrusive act, because it could
   provoke a problem.  so even if 'zpool status' is only reading, not
   writing to disk nor to data structures inside the kernel, it is
   still not really a status tool.  It's an invasive
   poking/pinging/restarting/breaking tool.  Such tools should be
   segregated, and shouldn't substitute for the requirement to have
   true status tools that only read data structures kept in the
   kernel, not update kernel structures and not touch disks.  This
   would be like if 'ps' made an implicit call to rcapd, or activated
   some swapping thread, or something like that.  ``My machine is
   sluggish.  I wonder what's slowing it down.  ...'ps'...  oh, shit,
   now it's not responding at all, and I'll never know why.''

   There can be other tools, too, but I think LVM2 and SVM both have
   carefully non-invasive status tools, don't they?

   This principle should be followed everywhere.  For example,
   'iscsiadm list discovery-address' should simply list the discovery
   addresses.  It should not implicitly attempt to contact each
   discovery address in its list, while I wait.

-----8<-----
terabithia:/# time iscsiadm list discovery-address
Discovery Address: 10.100.100.135:3260
Discovery Address: 10.100.100.138:3260

real    0m45.935s
user    0m0.006s
sys     0m0.019s
terabithia:/# jobs
[1]+  Running                 zpool status &
terabithia:/# 
-----8<-----

   now, if you're really scalable, try the above again with 100 iSCSI
   targets and 20 pools.  A single 'iscsiadm list discovery-address'
   command, even if it's sort-of ``working'', can take hours to
   complete.

   This does not happen on Linux where I configure through text files
   and inspect status through 'cat /proc/...'

In other words, it's not just that the information 'zpool status'
gives is inaccurate.  It's not just that some information is hidden
(like how sometimes a device listed as ONLINE will say ``no valid
replicas'' when you try to offline it, and sometimes it won't, and the
only way to tell the difference is to attempt to offline the
device---so trying to 'zpool offline' each device in turn is a way to
get some more indication of pool health than what 'zpool status' gives
on its own).  It's also that I don't trust 'zpool status' not to
affect the information it's supposed to be reporting.

Attachment: pgpZVUcCuAgE7.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to