> Today's rebuild has given me uptimes of below an hour, usually.  The box will 
> stay up in single user mode long enough to rebuild world/kernel, but 
> multi-user it is panicking at 
> /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:1592
> The backtrace shows that it gets to this panic from a sendfile() syscall.  
> The line above is in the middle of a big edit that's part of svn revision 
> 329363.  The tripping assertion seems to suggest that m->valid != 0, for 
> whatever that's worth.

I am doing a bit of an offline investigation with Andrew and it seems that the
actual panic message is this:

panic: vm_page_assert_xbusied: page 0xfffff807ebbd8f98 not exclusive busy @

The stack is this:
vpanic() at vpanic/frame 0xfffffe00b3c36390
dmu_read_pages() at dmu_read_pages+0x535/frame 0xfffffe00b3c36460
zfs_freebsd_getpages() at zfs_freebsd_getpages+0x24c/frame 0xfffffe00b3c36510
VOP_GETPAGES_APV() at VOP_GETPAGES_APV+0xd9/frame 0xfffffe00b3c36540
vop_stdgetpages_async() at vop_stdgetpages_async+0x49/frame 0xfffffe00b3c36590
VOP_GETPAGES_ASYNC_APV() at VOP_GETPAGES_ASYNC_APV+0xd9/frame 0xfffffe00b3c365c0
vnode_pager_getpages_async() at vnode_pager_getpages_async+0x81/frame
vn_sendfile() at vn_sendfile+0xe70/frame 0xfffffe00b3c368e0
sendfile() at sendfile+0x149/frame 0xfffffe00b3c36980
amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe00b3c36ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0x7fffffffdb00

I looked at sendfile_swapin() code and it seems that it uses the pager API in an
undocumented way.  Specifically, it inserts bogus_page into the array of
requested pages.  For starters, bogus_page is not busied and VOP_GETPAGES is
documented to have all requested pages exclusively busied.  Second, I always had
an impression that bogus_page is an implementation detail of the unified buffer
/ page cache and that other code need not be aware of it.

So, my opinion is that the sendfile code uses a "clever hack" that happens to
work with the buffer cache based filesystems, but that that hack is a bug.
So, I'd prefer that the problem is fixed in that code.
But I am open to being convinced that all VOP_GETPAGES implementations,
including that in ZFS, must be made aware of bogus_page.  Or, at least, that
they should not verify that the requested pages are busied.

Andriy Gapon
