On 26/11/2017 00:17, Warner Losh wrote: > > > On Sat, Nov 25, 2017 at 10:40 AM, Andriy Gapon <[email protected] > <mailto:[email protected]>> wrote: > > > Before anything else, I would like to say that I got an impression that > we speak > from so different angles that we either don't understand each other's > words or, > even worse, misinterpret them. > > > I understand what you are suggesting. Don't take my disagreement with your > proposal as willful misinterpretation. You are proposing something that's a > quick hack.
Very true. > Maybe a useful one, but it's still problematical because it has the > upper layers telling the lower layers what to do (don't do your retry), rather > than what service to provide (I prefer a fast error exit to over every effort > to > recover the data). Also true. > And it also does it by overloading the meaning of EIO, which > has real problems which you've not been open to listening, I assume due to > your > narrow use case apparently blinding you to the bigger picture issues with that > route. Quite likely. > However, there's a way forward which I think that will solve these objections. > First, designate that I/O that fails due to short-circuiting the normal > recovery > process, return ETIMEDOUT. The I/O stack currently doesn't use this at all (it > was introduced for the network side of things). This is a general catch-all > for > an I/O that we complete before the lower layers have given it the maximum > amount > of effort to recover the data, at the user request. Next, don't use a flag. > Instead add a 32-bit field that is call bio_qos for quality of service hints > and > another 32-bit field for bio_qos_param. This allows us to pass down specific > quality of service desires from the filesystem to the lower layers. The > parameter will be unused in your proposal. BIO_QOS_FAIL_EARLY may be a good > name > for a value to set it to (at the moment, just use 1). We'll assign the other > QOS > values later for other things. It would allow us to implement the other sorts > of > QoS things I talked about as well. That's a very interesting and workable suggestion. I will try to work on it. > As for B_FAILFAST, it's quite unlike what you're proposing, except in one > incidental detail. It's a complicated state machine that the sd driver in > solaris implemented. It's an entire protocol. When the device gets errors, it > goes into this failfast state machine. The state machine makes a determination > that the errors are indicators the device is GONE, at least for the moment, > and > it will fail I/Os in various ways from there. Any new I/Os that are submitted > will be failed (there's conditional behavior here: depending on a global > setting > it's either all I/O or just B_FAILFAST I/O). Yeah, I realized that B_FAILFAST was quite different from the first impression that I got from its name. Thank you for doing and sharing your analysis of how it actually works. > ZFS appears to set this bit for its > discovery code only, when a device not being there would significantly delay > things. I think that ZFS sets the bit for all 'first-attempt' I/O. It's the various retries / recovery where this bit is not set. > Anyway, when the device returns (basically an I/O gets through or maybe > some other event happens), the driver exists this mode and returns to normal > operation. It appears to be designed not for the use case that you described, > but rather for a drive that's failing all over the place so that any pending > I/Os get out of the way quickly. Your use case is only superficially similar > to > that use case, so the Solaris / Illumos experiences are mildly interesting, > but > due to the differences not a strong argument for doing this. This facility in > Illumos is interesting, but would require significantly more retooling of the > lower I/O layers in FreeBSD to implement fully. Plus Illumos (or maybe just > Solaris) a daemon that looks at failures to manage them at a higher level, > which > might make for a better user experience for FreeBSD, so that's something that > needs to be weighed as well. Okay. > We've known for some time that HDD retry algorithms take a long time. Same is > true of some SSD or NVMe algorithms, but not all. The other objection I have > to > 'noretry' namingĀ is that it bakes the current observed HDD behavior and > recovery into the API. This is undesirable as other storage technologies have > retry mechanisms that happen quite quickly (and sometimes in the drive > itself). > The cutoff between fast and slow recovery is device specific, as are the > methods > used. For example, there's new proposals out in NVMe (and maybe T10/T13 land) > to > have new types of READ commands that specify the quality of service you > expect, > including providing some sort of deadline hint to clip how much effort is > expended in trying to recover the data. It would be nice to design a mechanism > that allows us to start using these commands when drives are available with > them, and possibly using timeouts to allow for a faster abort. Most of your > HDD > I/O will complete within maybe ~150ms, with a long tail out to maybe as long > as > ~400ms. It might be desirable to set a policy that says 'don't let any I/Os > remain in the device longer than a second' and use this mechanism to enforce > that. Or don't let any I/Os last more than 20x the most recent median I/O > time. > A single bit is insufficiently expressive to allow these sorts of things, > which > is another reason for my objection to your proposal. With the QOS fields being > independent, the clone routines just copies them and makes no judgement value > on > them. I now agree with this. Thank you for the detailed explanation. > So, those are my problems with your proposal, and also some hopefully useful > ways to move forward. I've chatted with others for years about introducing QoS > things into the I/O stack, so I know most of the above won't be too > contentious > (though ETIMEDOUT I haven't socialized, so that may be an area of concern for > people). Thank you! -- Andriy Gapon _______________________________________________ [email protected] mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-geom To unsubscribe, send any mail to "[email protected]"
