Hi One possibility would be to make it MD build-time OTIONS, defaulting 1M on regular systems and 128k on smaller systems.
Of course I guess making it a tunable (or sysctl) would be best, though. On Sat, 3 Jun 2017 23:49:01 -0600 Warner Losh <i...@bsdimp.com> wrote: > On Sat, Jun 3, 2017 at 11:28 PM, Warner Losh <i...@bsdimp.com> wrote: > > > > > > > On Sat, Jun 3, 2017 at 9:55 PM, Allan Jude <allanj...@freebsd.org> wrote: > > > >> On 2017-06-03 22:35, Julian Elischer wrote: > >> > On 4/6/17 4:59 am, Colin Percival wrote: > >> >> On January 24, 1998, in what was later renumbered to SVN r32724, dyson@ > >> >> wrote: > >> >>> Add better support for larger I/O clusters, including larger physical > >> >>> I/O. The support is not mature yet, and some of the underlying > >> >>> implementation > >> >>> needs help. However, support does exist for IDE devices now. > >> >> and increased MAXPHYS from 64 kB to 128 kB. Is it time to increase it > >> >> again, > >> >> or do we need to wait at least two decades between changes? > >> >> > >> >> This is hurting performance on some systems; in particular, EC2 "io1" > >> >> disks > >> >> are optimized for 256 kB I/Os, EC2 "st1" (throughput optimized > >> >> spinning rust) > >> >> disks are optimized for 1 MB I/Os, and Amazon's NFS service (EFS) > >> >> recommends > >> >> using a maximum I/O size of 1 MB (and despite NFS not being *physical* > >> >> I/O it > >> >> seems to still be limited by MAXPHYS). > >> >> > >> > We increase it in freebsd 8 and 10.3 on our systems, Only good results. > >> > > >> > sys/sys/param.h:#define MAXPHYS (1024 * 1024) /* max raw I/O > >> > transfer size */ > >> > > >> > _______________________________________________ > >> > email@example.com mailing list > >> > https://lists.freebsd.org/mailman/listinfo/freebsd-current > >> > To unsubscribe, send any mail to "freebsd-current-unsubscribe@f > >> reebsd.org" > >> > >> At some point Warner and I discussed how hard it might be to make this a > >> boot time tunable, so that big amd64 machines can have a larger value > >> without causing problems for smaller machines. > >> > >> ZFS supports a block size of 1mb, and doing I/Os in 128kb negates some > >> of the benefit. > >> > >> I am preparing some benchmarks and other data along with a patch to > >> increase the maximum size of pipe I/O's as well, because using 1MB > >> offers a relatively large performance gain there as well. > >> > > > > It doesn't look to be hard to change this, though struct buf depends on > > MAXPHYS: > > struct vm_page *b_pages[btoc(MAXPHYS)]; > > and b_pages isn't the last item in the list, so changing MAXPHYS at boot > > time would cause an ABI change. IMHO, we should move it to the last element > > so that wouldn't happen. IIRC all buf allocations are from a fixed pool. > > We'd have to audit anybody that creates one on the stack knowing it will be > > persisted. Given how things work, I don't think this is possible, so we may > > be safe. Thankfully, struct bio doesn't seem to be affected. > > > > As for making it boot-time configurable, it shouldn't be too horrible with > > the above change. We should have enough of the tunables mechanism up early > > enough to pull this in before we create the buf pool. > > > > Netflix runs MAXPHYS of 8MB. There's issues with something this big, to be > > sure, especially on memory limited systems. Lots of hardware can't do this > > big an I/O, and some drivers can't cope, even if the underlying hardware > > can. Since we don't use such drivers at work, I don't have a list handy > > (though I think the SG list for NVMe limits it to 1MB). 128k is totally > > reasonable bump by default, but I think going larger by default should be > > approached with some caution given the overhead that adds to struct buf. > > Having it be a run-time tunable would be great. > > > > Of course 128k is reasonable, it's the current default :). I'd mean to say > that doubling would have a limited impact. 1MB might be a good default, but > it might be too big for smaller systems (nothing says it has to be a MI > constant, though). It would be a perfectly fine default if it were a > tunable. > > > > There's a number of places in userland that depend on MAXPHYS, which is > > unfortunate since they assume a fixed value and don't pick it up from the > > kernel or kernel config. Thankfully, there are only a limited number of > > these. > > > > There's a number of other places that assume MAXPHYS is constant. The ahci > driver uses it to define the max number of SG operations you can have, for > example. aio has an array sized based off of it. There are some places that > use this when they should use 128k instead. There's several places that use > it to define other constants, and it would take a while to run them all to > ground to make sure they are all good. We might need to bump DFLTPHYS as > well, so it might also make a good tunable. There's a few places that check > things in terms of a fixed multiple of MAXPHYS that are rules of thumb that > kinda work today maybe by accident or maybe the 100 * MAXPHYS is highly > scientific. It's hard to say without careful study. > > For example, until recently, nvmecontrol would use MAXPHYS. But it's the > system default MAXPHYS. And even if it isn't, there's currently a hard > limit of 1MB for an I/O imposed by how the driver uses nvme's SG lists. But > it doesn't show up as MAXPHYS, but rather as NVME_MAX_XFER_SIZE in places. > It totally surprised me when I hit this problem at runtime and tracked it > to ground. > > > > Of course, there's times when I/Os can return much more than this. Reading > > drive log pages, for example, can generate tens or hundreds of MB of data, > > and there's no way to do that with one transaction today. If drive makers > > were perfect, we could use the generally defined offset and length fields > > to read them out piecemeal. If the log is table, a big if for some of the > > snapshots of internal state logs that are sometimes necessary to > > investigate problems... It sure would be nice if there were a way to have > > super-huge I/O on an exception basis for these situations. > > > > The hardest part about doing this is chasing down all the references since > it winds up in the craziest of places. > > Warner > _______________________________________________ > firstname.lastname@example.org mailing list > https://lists.freebsd.org/mailman/listinfo/freebsd-current > To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org" > -- Tomoaki AOKI <junch...@dec.sakura.ne.jp> _______________________________________________ email@example.com mailing list https://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to "freebsd-current-unsubscr...@freebsd.org"