Re: [perf-discuss] Heavy IO from sched

Brendan Gregg - Sun Microsystems Fri, 14 Sep 2007 21:41:35 -0700

G'Day,

On Thu, Sep 13, 2007 at 10:39:44AM -0700, Ben Rockwood wrote:
> Matty wrote:
> >On 9/13/07, Ben Rockwood <[EMAIL PROTECTED]> wrote:
> >  
> >>I've been using DTrace more and more to investigate storage performance
> >>issues and keep bumping into something I can't properly explain.  The
> >>following snipped comes from Brenden's 'iotop':
> >>
> >>...
> >>  102  13171  11536 pickup           sd6      31 384 R          2364416
> >>   80  14597  10632 httpd            sd5      31 320 R          2763776
> >>  100   5089   4977 mysqld           sd5      31 320 W          4349952
> >>    0      0      0 sched            sd5      31 320 W          5414912
> >>    0      0      0 sched            sd3      31 192 R          8247296
> >>    0      0      0 sched            sd4      31 256 R         11098112
> >>
> >>This is on a snv_67 system with several active zones.  I've been ignoring
> >>the 'sched' IO but I'd like to understand it.  Can someone clue me in on 
> >>why
> >>I see so much IO that appears to actually be touching disk?
> >>    
> >
> >If I recall correctly, the writes are associated with dirty pages
> >being flushed from the page cache, and the reads are typically
> >associated with file system read ahead activity. I am basing this on
> >something Brendan posted a long long time ago, but I can't seem to
> >find his original message to verify this. Brendan -- if I'm wrong,
> >please correct me.


Yep, that was a long time ago when I was looking at UFS :)

That was correct - two gotchas (at least) for the io provider were,

1. writes were frequently associated with sched due to flushing dirty pages
2. reads were often larger or at different offsets than the process asked
   for, since we are tracing after the read ahead code.

These should both be true for ZFS, although caused by vastly different
implementations.

What seems to have changed is that reads are now much more frequently
being identified as sched. This looks like it is mostly due to ZFS
prefetch queuing transactions on the ZIO pipeline, which are eventually
sent to disk long after the user-thread has left. (It can also be due to
zpool scrub, if you are running one).

Here I read a file using "dd", and show who is causing device reads,

# dtrace -n 'io:::start /args[0]->b_flags & B_READ/ { @[execname, stack(16)]
    = count(); }'
dtrace: description 'io:::start ' matched 7 probes
^C

  dd                                                
              genunix`ldi_strategy+0x54
              zfs`vdev_disk_io_start+0x214
              zfs`vdev_io_start+0x1d
              zfs`zio_vdev_io_start+0x205
              zfs`zio_next_stage_async+0x173
              zfs`zio_nowait+0x11
              zfs`vdev_mirror_io_start+0x19a
              zfs`zio_vdev_io_start+0x213
              zfs`zio_next_stage+0x16b
              zfs`zio_ready+0x10e
              zfs`zio_next_stage+0x16b
              zfs`zio_wait_for_children+0x56
              zfs`zio_wait_children_ready+0x20
              zfs`zio_next_stage_async+0x173
              zfs`zio_nowait+0x11
              zfs`arc_read+0x5e9
                4
  sched                                             
              genunix`ldi_strategy+0x54
              zfs`vdev_disk_io_start+0x214
              zfs`vdev_io_start+0x1d
              zfs`zio_vdev_io_start+0x205
              zfs`zio_next_stage+0x16b
              zfs`vdev_queue_io_done+0xa5
              zfs`vdev_disk_io_done+0x16
              zfs`vdev_io_done+0x1d
              zfs`zio_vdev_io_done+0x1b
              genunix`taskq_thread+0x1dc
              unix`thread_start+0x8
              508

The "dd" context I/O was from arc_read(), as expected (a longer stack
traces it back to the syscall). The "sched" context I/O was from a
kernel thread, which issues more I/O from vdev_queue_io_done(); Since
it is issuing I/O in the context of completed I/O, the execname is
unlikely to be the original caller.

It should be possible to use DTrace to cache the pid and execname of ZFS
prefetch requests, so that the original caller can be identified in
tools like iotop. I had a quick look and didn't find anything easy;
instead it looked like it would need several fbt probes - which would
make the io* tools unstable... Can anyone on this list think of an easy
place to cache zio_t -> execname, such that a dd read from a file would
entirerly match "dd"?

This may strengthen the desire for a stable DTrace zio provider to be
created, for better observability of this I/O.

> Ya, I got an off-list response suggesting that this activity is 
> particularly ZFS flushing TXG's to disk.  Which makes sense but is 
> un-intuative... its not "sched" per se thats doing IO but the kernel, 
> but the kernel gets represented as PID 0 and if you deref the PID name 
> you get "sched".

Yes, it is confusing, and there should be an RFE to create a /proc 
representation of the txg_sync_thread, just as there are pageout and
fsflush threads.

> I imagine that the reason I see one "sched" reported per disk is because 
> each ZPool member disk is being flushed by a separate kernel thread.

That's only because the aggregate keys on the disk as well as execname,
so sched+sd3 is a different key to sched+sd4...

cheers,

Brendan

-- 
Brendan
[CA, USA]
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] Heavy IO from sched

Reply via email to