Hi Andreas,
When the watchdog fires (prints the stack of a thread) does this mean
that the thread has been hogging the cpu for 100 seconds? Or that the
thread has been *sleeping* for 100 seconds? Or something else?
The stack Daniel posted shows io_schedule called from __wait_on_buffer
which would be typical of a thread waiting for disk i/o completion.
Call Trace:<ffffffffa0024125>{:sd_mod:sd_iostats_bump+147}
<ffffffffa031429a>{:ib_srp:srp_host_qcommand+399}
<ffffffff80253ebf>{deadline_next_request+34}
<ffffffff8024b329>{elv_next_request+238}
<ffffffff80309843>{io_schedule+38}
<ffffffff8017843c>{__wait_on_buffer+125}
<ffffffff801782c2>{bh_wake_function+0}
<ffffffff801782c2>{bh_wake_function+0}
<ffffffffa05771d9>{:ldiskfs:ldiskfs_mb_init_cache+469}
<ffffffff80157ba2>{add_to_page_cache+167}
<ffffffffa0577792>{:ldiskfs:ldiskfs_mb_load_buddy+257}
<ffffffffa057a89f>{:ldiskfs:ldiskfs_mb_new_blocks+1946}
<ffffffffa05b480e>{:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+729}
<ffffffffa0574362>{:ldiskfs:ldiskfs_ext_find_extent+205}
<ffffffffa0575a69>{:ldiskfs:ldiskfs_ext_walk_space+535}
<ffffffffa05b4535>{:fsfilt_ldiskfs:ldiskfs_ext_new_extent_cb+0}
<ffffffffa05b4b56>{:fsfilt_ldiskfs:fsfilt_map_nblocks+236}
And yes ldiskfs_ext_new_extent_cb is higher up the stack, which means
that in the bigger picture this thread is searching for free blocks, but
did the watchdog appear because the thread has been searching for free
blocks for more than 100 seconds or because *this specific i/o request*
has been pending for more than 100 seconds?
Joe.
Postal Address: Hewlett Packard Galway Ltd., Ballybrit Business Park,
Galway, Ireland
Registered Office: 63-74 Sir John Rogerson's Quay, Dublin 2, Ireland.
Registered Number: 361933
The contents of this message and any attachments to it are confidential
and may be legally privileged. If you have received this message in
error you should delete it from your system immediately and advise the
sender. To any recipient of this message within HP: unless otherwise
stated you should consider this message and attachments as "HP
CONFIDENTIAL".
-----Original Message-----
From: Andreas Dilger [mailto:[EMAIL PROTECTED]
Sent: 17 August 2007 19:38
To: Daniel Leaberry
Cc: [email protected]
Subject: Re: [Lustre-discuss] Sudden ost crashing appears to take > 100s
tofind free extents
On Aug 17, 2007 07:28 -0600, Daniel Leaberry wrote:
> I have an interesting problem. I've made no changes to the IB DDN
> storage yet I'm finding OST's crashing left and right. The thread
> watchdog gets triggered,
Note that a watchdog thread stack dump is NOT a crash, but rather a
debugging mechanism so we can see where the thread is stuck for such
a long time. It should be able to continue working even after this
happens.
> Is there anyway to tune the extent searching code? Does my analysis
seem
> likely? Is this fixed in 1.6.1 such that I should upgrade immediately?
You could increase the watchdog thread timeout (this is currently a
compile time constant), but that won't remove the fact that it is taking
100s to find free space.
Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.
_______________________________________________
Lustre-discuss mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-discuss