Gavin,

Thanks for the reply:

On 5/4/07, Gavin Maltby <[EMAIL PROTECTED]> wrote:
Hi Peter,

On 05/02/07 13:37, Peter Tribble wrote:
> There's this interesting comment about line 3235 of hat_sfmmu.c
>
>             /*
>              * Hblk_hmecnt and hblk_vcnt could be non zero
>              * since hblk_unload() does not gurantee that.
>              *
>              * XXX - this could cause tteload() to spin
>              * where sfmmu_shadow_hcleanup() is called.
>                */
>
> (See
> 
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/sfmmu/vm/hat_sfmmu.c#3235)
 >
>
> I have a system (V440, running S10U3) that has an unkillable process
> consuming a whole cpu that appears to be doing just this. At least,
> dtrace tells me the stack traces are all around:
>
>              unix`sfmmu_free_hblks+0x34
>              unix`sfmmu_shadow_hcleanup+0x58
>              unix`sfmmu_free_hblks+0x1b0
>              unix`sfmmu_shadow_hcleanup+0x58
>              unix`sfmmu_tteload_find_hmeblk+0x1f0
>              unix`sfmmu_tteload_array+0x40
>              unix`hat_memload_array+0x184
>              genunix`segvn_fault_vnodepages+0x116c
>              genunix`segvn_fault+0x3f0
>              genunix`as_fault+0x4c8
>              unix`pagefault+0xac
>              unix`trap+0xd44
>              unix`utl0+0x4c

Were you able to determine whether we're looping within that stack,
and from which level?  Or are we taking repeated pagefaults on that
thread?

Well, I used:

# dtrace -n profile-1234hz'/pid == 27742/[EMAIL PROTECTED]()] = count()}'

And it appears that the process is stuck going round this loop.
For hours, possibly. In the five or ten minutes I checked I didn't
see it go out from the lower sfmmu_shadow_hcleanup.

This is also related (possibly) to other hangs we've seen on this
system, with similar unkillable processes and other processes
stuck blocked in sfmmu code.

> OK. How do I confirm the nature of the problem? Is there anything
> I can do about it? How can I get this fixed?
>
> (Yes, I am going to log a support call - in fact I already have one open
> for this system, and this problem might be realted. But I am interested
> in learning more and digging a little deeper.)

This part of the sfmmu code is exceptionally complex (sfmmu is complex at
its best!) and steeped in history.  I don't recommend digging too deep unless
you're into pain!  For those who are ...

Oh dear. Thanks for the information anyway, I'll try and digest it more
closely...

[elided]

So there is a small chance that back in sfmmu_shadow_hcleanup we could
still see the hblk_cnt and hblk_hmecnt non-zero, which is the comment
you have highlighted.  In that case we do not remove the block from
the hash chain, and proceed on to the next chunk of addresses.
When we unwind back to the sfmmu_tteload_find_hmeblk it could
find this same block again and cause us to descend through the
whole process again.  This won't happen indefinitely since
sooner or later the pageunload will be unpinned and will
decrement the counters allowing us to free the block.

That's how it should work.  If you are spinning in that
stack at and below sfmmu_tteload_find_hmeblk then chances
are that some mismanagement of the hmecnt/vcnt has happened.
This may be verified through some post-mortem crash
analysis.

I have some crash dumps - what should I be looking for?

Thanks,

--
-Peter Tribble
http://www.petertribble.co.uk/ - http://ptribble.blogspot.com/
_______________________________________________
opensolaris-code mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/opensolaris-code

Reply via email to