G'Day, On Thu, Sep 13, 2007 at 10:39:44AM -0700, Ben Rockwood wrote: > Matty wrote: > >On 9/13/07, Ben Rockwood <[EMAIL PROTECTED]> wrote: > > > >>I've been using DTrace more and more to investigate storage performance > >>issues and keep bumping into something I can't properly explain. The > >>following snipped comes from Brenden's 'iotop': > >> > >>... > >> 102 13171 11536 pickup sd6 31 384 R 2364416 > >> 80 14597 10632 httpd sd5 31 320 R 2763776 > >> 100 5089 4977 mysqld sd5 31 320 W 4349952 > >> 0 0 0 sched sd5 31 320 W 5414912 > >> 0 0 0 sched sd3 31 192 R 8247296 > >> 0 0 0 sched sd4 31 256 R 11098112 > >> > >>This is on a snv_67 system with several active zones. I've been ignoring > >>the 'sched' IO but I'd like to understand it. Can someone clue me in on > >>why > >>I see so much IO that appears to actually be touching disk? > >> > > > >If I recall correctly, the writes are associated with dirty pages > >being flushed from the page cache, and the reads are typically > >associated with file system read ahead activity. I am basing this on > >something Brendan posted a long long time ago, but I can't seem to > >find his original message to verify this. Brendan -- if I'm wrong, > >please correct me.
Yep, that was a long time ago when I was looking at UFS :) That was correct - two gotchas (at least) for the io provider were, 1. writes were frequently associated with sched due to flushing dirty pages 2. reads were often larger or at different offsets than the process asked for, since we are tracing after the read ahead code. These should both be true for ZFS, although caused by vastly different implementations. What seems to have changed is that reads are now much more frequently being identified as sched. This looks like it is mostly due to ZFS prefetch queuing transactions on the ZIO pipeline, which are eventually sent to disk long after the user-thread has left. (It can also be due to zpool scrub, if you are running one). Here I read a file using "dd", and show who is causing device reads, # dtrace -n 'io:::start /args[0]->b_flags & B_READ/ { @[execname, stack(16)] = count(); }' dtrace: description 'io:::start ' matched 7 probes ^C dd genunix`ldi_strategy+0x54 zfs`vdev_disk_io_start+0x214 zfs`vdev_io_start+0x1d zfs`zio_vdev_io_start+0x205 zfs`zio_next_stage_async+0x173 zfs`zio_nowait+0x11 zfs`vdev_mirror_io_start+0x19a zfs`zio_vdev_io_start+0x213 zfs`zio_next_stage+0x16b zfs`zio_ready+0x10e zfs`zio_next_stage+0x16b zfs`zio_wait_for_children+0x56 zfs`zio_wait_children_ready+0x20 zfs`zio_next_stage_async+0x173 zfs`zio_nowait+0x11 zfs`arc_read+0x5e9 4 sched genunix`ldi_strategy+0x54 zfs`vdev_disk_io_start+0x214 zfs`vdev_io_start+0x1d zfs`zio_vdev_io_start+0x205 zfs`zio_next_stage+0x16b zfs`vdev_queue_io_done+0xa5 zfs`vdev_disk_io_done+0x16 zfs`vdev_io_done+0x1d zfs`zio_vdev_io_done+0x1b genunix`taskq_thread+0x1dc unix`thread_start+0x8 508 The "dd" context I/O was from arc_read(), as expected (a longer stack traces it back to the syscall). The "sched" context I/O was from a kernel thread, which issues more I/O from vdev_queue_io_done(); Since it is issuing I/O in the context of completed I/O, the execname is unlikely to be the original caller. It should be possible to use DTrace to cache the pid and execname of ZFS prefetch requests, so that the original caller can be identified in tools like iotop. I had a quick look and didn't find anything easy; instead it looked like it would need several fbt probes - which would make the io* tools unstable... Can anyone on this list think of an easy place to cache zio_t -> execname, such that a dd read from a file would entirerly match "dd"? This may strengthen the desire for a stable DTrace zio provider to be created, for better observability of this I/O. > Ya, I got an off-list response suggesting that this activity is > particularly ZFS flushing TXG's to disk. Which makes sense but is > un-intuative... its not "sched" per se thats doing IO but the kernel, > but the kernel gets represented as PID 0 and if you deref the PID name > you get "sched". Yes, it is confusing, and there should be an RFE to create a /proc representation of the txg_sync_thread, just as there are pageout and fsflush threads. > I imagine that the reason I see one "sched" reported per disk is because > each ZPool member disk is being flushed by a separate kernel thread. That's only because the aggregate keys on the disk as well as execname, so sched+sd3 is a different key to sched+sd4... cheers, Brendan -- Brendan [CA, USA] _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org