Re: Strange system hangs

2007-12-02 Thread Krzysztof Oledzki



On Sat, 29 Sep 2007, Nick Piggin wrote:


On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like "less
/var/log/all.log". Using strace I found that process blocks on:


Is this a regression? If so, what's the most recent kernel that didn't show
the problem?

The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)

Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).

Are any IO errors occurring at all?


It seems that 2.6.23.x still fails but somehow different. I updated my 
bugreport at: http://bugzilla.kernel.org/show_bug.cgi?id=9182. There are 
new attachments with traces and an oops that happened while I was taking 
the debugging data.


Thank you.

Best regards,


Krzysztof Olędzki

Re: Strange system hangs

2007-12-02 Thread Krzysztof Oledzki



On Sat, 29 Sep 2007, Nick Piggin wrote:


On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like less
/var/log/all.log. Using strace I found that process blocks on:


Is this a regression? If so, what's the most recent kernel that didn't show
the problem?

The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)

Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).

Are any IO errors occurring at all?


It seems that 2.6.23.x still fails but somehow different. I updated my 
bugreport at: http://bugzilla.kernel.org/show_bug.cgi?id=9182. There are 
new attachments with traces and an oops that happened while I was taking 
the debugging data.


Thank you.

Best regards,


Krzysztof Olędzki

Re: Strange system hangs

2007-11-02 Thread Thomas Osterried
Hello,

This report tends to become a novel. In short, the most important facts:
  - after some days uptime, suddenly a process like rsync is in
a write congestion; other processes follow.
  - balance_dirty_pages_ratelimited_nr Problem
  - great amount of dirty pages
  - processes do not terminate and cause a heavy load
  - process accounting, even though not enabled?


We have serious problems on some servers running kernel 2.6.19 and up.
The thread
  http://marc.info/?l=linux-kernel=119252148829463=2
matches exactly our problem. In particular, the
balance_dirty_pages_ratelimited_nr problem in Krzysztof Oledzki's trace.

It seems to be the same problem like in
  http://marc.info/?l=linux-kernel=119125485615927=2
which may be fixed by this patch for 2.6.23-git
  http://marc.info/?l=git-commits-head=119263941428270=2
but may differ, because
  - we do not have any nfs, loop or fuse mounts
  - it's regarded as speed issue which resovles in seconds but not hours



We have observed that the problem occurs within about 10 days,
(but one kernel version showed it within 24h on the same machine).
For some machines it takes one or two months for the problem to come up.
But we also have machines (comparable setup, same linux installation
(debian sarge), same kernel config, completely different hardware) running
kernel 2.6.20 where it does not emerge.
We also have machines running kernel <= 2.6.18 which never showed the problem.
Our kernel config always derived from previous kernels.
Our kernel is vanilla and comes from ftp.kernel.org. But for a test,
we also tried kernel 2.6.20-16 from ubuntu, which also showed the problem.
Of course, we have tried the kernel without commercial modules (not tainted).

Unfortunately we could not force the error to happen (we have to wait).
And very interestingly, we completely exchanged all hardware components
(taken from a machine where the bug did not happen) and the unstable
server still left unstable.

We observe that the dirty pages count increases (/prov/vmstat) and
/proc/meminfo shows an amount of 400 MB (!) when the problem appears.
It's mostly during the backup process (rsync). But we also had the
failure when backup was turned off; we just had to wait longer for it
to happen; rsync seems to be a catalysator for the problem.

When the error occurs, then processes do not terminate: they try to exit,
but still remain in the process list. The machine is powerful, and thus
even if the load is above 500, the program itself is fast and responsive,
and after the last expectet lines of a program (i.e. "uptime") it does not
terminate and remains in 'D' state.
When killing the "rsync" process (or / and others), the machine may recover.
Or if we have time to wait (we usually have not), the lock resolves after
several hours. In a test we "waited" 14 hours. The number of dirty pages
decreases from 400 MB to <100 MB.
We observed, that after the machine has run into trouble (i.e. after 10
days troubleless uptime), it then always shows this error on the daily rsync
backup.

Diagnostics:
with "echo t > /proc/sysrq-trigger" we see, that many processes hang in a
mutex_lock after ext3_file_write(). Some of these are in congestion_wait()
after balance_dirty_pages_ratelimited_nr() after ext3_file_write().
We could not enforce this long time deadlock by hand. But it's obvoiously
the same (due to the call trace) because we can trigger a short-time
with multible concurrent "dd if=/dev/zero of=foo bs=4000k" processes.

I could only speculate if the non-terminating processes cause or tighten
the problem, or if they're just the cause of process accounting (see
below) which is also in wait state for writing the data to a file.

Nevertheless the sysrq-trigger method allowed us to see what causes
terminating processes to wait in their exit()-call:
do_group_exit() calls do_exit() which calls acct_process(). acct_process()
does a do_sync_write() which hangs in a mutex_lock.

If we boot a machine, then enable process accounting (acct(2)) and 
then do the file-I/O tests mentioned above, we have the same effect of
non-terminating processes, and the sysrq-trigger result corresponds. They
terminate after some outstanding blocks from "dd" are written.
If process accounting is off, the kernel does not call acct_process() (tested),
which is expected.

Ok, this explains the many non-terminating processes and the load.
But it raises another question. We do not have and do not need process
accounting and we do not even have installed the accton tools.
Thus, why does the buggy machine calls acct_process() during the exit
of processes?

Unfortunately, the kernel has no fence (/sys would be nice) for looking if
the process accounting is really on, and if, to which file it actually writes.
For the next error, which we tensly await to happen, we are prepared to:
  - force process-accounting off with call acct(0) and examine the output
of sysrq-trigger
  - install a patched kernel which gives us the opportunity to 

Re: Strange system hangs

2007-11-02 Thread Thomas Osterried
Hello,

This report tends to become a novel. In short, the most important facts:
  - after some days uptime, suddenly a process like rsync is in
a write congestion; other processes follow.
  - balance_dirty_pages_ratelimited_nr Problem
  - great amount of dirty pages
  - processes do not terminate and cause a heavy load
  - process accounting, even though not enabled?


We have serious problems on some servers running kernel 2.6.19 and up.
The thread
  http://marc.info/?l=linux-kernelm=119252148829463w=2
matches exactly our problem. In particular, the
balance_dirty_pages_ratelimited_nr problem in Krzysztof Oledzki's trace.

It seems to be the same problem like in
  http://marc.info/?l=linux-kernelm=119125485615927w=2
which may be fixed by this patch for 2.6.23-git
  http://marc.info/?l=git-commits-headm=119263941428270w=2
but may differ, because
  - we do not have any nfs, loop or fuse mounts
  - it's regarded as speed issue which resovles in seconds but not hours



We have observed that the problem occurs within about 10 days,
(but one kernel version showed it within 24h on the same machine).
For some machines it takes one or two months for the problem to come up.
But we also have machines (comparable setup, same linux installation
(debian sarge), same kernel config, completely different hardware) running
kernel 2.6.20 where it does not emerge.
We also have machines running kernel = 2.6.18 which never showed the problem.
Our kernel config always derived from previous kernels.
Our kernel is vanilla and comes from ftp.kernel.org. But for a test,
we also tried kernel 2.6.20-16 from ubuntu, which also showed the problem.
Of course, we have tried the kernel without commercial modules (not tainted).

Unfortunately we could not force the error to happen (we have to wait).
And very interestingly, we completely exchanged all hardware components
(taken from a machine where the bug did not happen) and the unstable
server still left unstable.

We observe that the dirty pages count increases (/prov/vmstat) and
/proc/meminfo shows an amount of 400 MB (!) when the problem appears.
It's mostly during the backup process (rsync). But we also had the
failure when backup was turned off; we just had to wait longer for it
to happen; rsync seems to be a catalysator for the problem.

When the error occurs, then processes do not terminate: they try to exit,
but still remain in the process list. The machine is powerful, and thus
even if the load is above 500, the program itself is fast and responsive,
and after the last expectet lines of a program (i.e. uptime) it does not
terminate and remains in 'D' state.
When killing the rsync process (or / and others), the machine may recover.
Or if we have time to wait (we usually have not), the lock resolves after
several hours. In a test we waited 14 hours. The number of dirty pages
decreases from 400 MB to 100 MB.
We observed, that after the machine has run into trouble (i.e. after 10
days troubleless uptime), it then always shows this error on the daily rsync
backup.

Diagnostics:
with echo t  /proc/sysrq-trigger we see, that many processes hang in a
mutex_lock after ext3_file_write(). Some of these are in congestion_wait()
after balance_dirty_pages_ratelimited_nr() after ext3_file_write().
We could not enforce this long time deadlock by hand. But it's obvoiously
the same (due to the call trace) because we can trigger a short-time
with multible concurrent dd if=/dev/zero of=foo bs=4000k processes.

I could only speculate if the non-terminating processes cause or tighten
the problem, or if they're just the cause of process accounting (see
below) which is also in wait state for writing the data to a file.

Nevertheless the sysrq-trigger method allowed us to see what causes
terminating processes to wait in their exit()-call:
do_group_exit() calls do_exit() which calls acct_process(). acct_process()
does a do_sync_write() which hangs in a mutex_lock.

If we boot a machine, then enable process accounting (acct(2)) and 
then do the file-I/O tests mentioned above, we have the same effect of
non-terminating processes, and the sysrq-trigger result corresponds. They
terminate after some outstanding blocks from dd are written.
If process accounting is off, the kernel does not call acct_process() (tested),
which is expected.

Ok, this explains the many non-terminating processes and the load.
But it raises another question. We do not have and do not need process
accounting and we do not even have installed the accton tools.
Thus, why does the buggy machine calls acct_process() during the exit
of processes?

Unfortunately, the kernel has no fence (/sys would be nice) for looking if
the process accounting is really on, and if, to which file it actually writes.
For the next error, which we tensly await to happen, we are prepared to:
  - force process-accounting off with call acct(0) and examine the output
of sysrq-trigger
  - install a patched kernel which gives us the opportunity to see if 
  

Re: Strange system hangs

2007-09-29 Thread Krzysztof Oledzki



On Sat, 29 Sep 2007, Nick Piggin wrote:


On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like "less
/var/log/all.log". Using strace I found that process blocks on:


Is this a regression? If so, what's the most recent kernel that didn't show
the problem?


I don't know. First kernel I ran was 2.6.20.x. This is quite fresh system.


The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)


OK. I'll try to do it next time if there will be a chance. It may take 
some time, BTW.



Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).


I'll have to check - maybe I have this captured. If not I'll check it next 
time.



Are any IO errors occurring at all?


Didn't notice - so no.

Thank you.

Best regards,


Krzysztof Olędzki

Re: Strange system hangs

2007-09-29 Thread Nick Piggin
On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:
> Hello,
>
> I am experiencing weird system hangs. Once about 2-5 weeks system freezes
> and stops accepting remote connections, so it is no longer possible to
> connect to most important services: smtp (postfix), www (squid) or even
> ssh. Such connection is accepted but then it hangs.
>
> What is strange, that previously established ssh session is usable. It is
> possible to work on such system until you do something stupid like "less
> /var/log/all.log". Using strace I found that process blocks on:

Is this a regression? If so, what's the most recent kernel that didn't show
the problem?

The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)

Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).

Are any IO errors occurring at all?

Thanks,
Nick
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange system hangs

2007-09-29 Thread Nick Piggin
On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:
 Hello,

 I am experiencing weird system hangs. Once about 2-5 weeks system freezes
 and stops accepting remote connections, so it is no longer possible to
 connect to most important services: smtp (postfix), www (squid) or even
 ssh. Such connection is accepted but then it hangs.

 What is strange, that previously established ssh session is usable. It is
 possible to work on such system until you do something stupid like less
 /var/log/all.log. Using strace I found that process blocks on:

Is this a regression? If so, what's the most recent kernel that didn't show
the problem?

The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)

Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).

Are any IO errors occurring at all?

Thanks,
Nick
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Strange system hangs

2007-09-29 Thread Krzysztof Oledzki



On Sat, 29 Sep 2007, Nick Piggin wrote:


On Friday 28 September 2007 18:42, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like less
/var/log/all.log. Using strace I found that process blocks on:


Is this a regression? If so, what's the most recent kernel that didn't show
the problem?


I don't know. First kernel I ran was 2.6.20.x. This is quite fresh system.


The symptoms could be consistent with some place doing a
balance_dirty_pages while holding a lock that is required for IO, but I can't
see a smoking gun (you've got contention on i_mutex, but that should be
OK).

Can you see if there is any memory under writeback that isn't being
completed (sysrq+M), also a list the locks held after the hang might be
helpful (compile in lockdep and sysrq+D)


OK. I'll try to do it next time if there will be a chance. It may take 
some time, BTW.



Is anything currently running? (sysrq+P and even a full sysrq+T task list
could be useful).


I'll have to check - maybe I have this captured. If not I'll check it next 
time.



Are any IO errors occurring at all?


Didn't notice - so no.

Thank you.

Best regards,


Krzysztof Olędzki

Re: Strange system hangs

2007-09-28 Thread Krzysztof Oledzki



On Fri, 28 Sep 2007, Peter Zijlstra wrote:


On Fri, 2007-09-28 at 10:42 +0200, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like "less
/var/log/all.log".


So it takes weeks to reproduce this?


Unfortunately, yes. :(


  freesibling
   task PCstack   pid father child younger older
syslogd   D F5C83C60 0  2162  1 (NOTLB)
f5c83c74 0082 0002 f5c83c60 f5c83c5c   78538d20
0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 f7f6a17c
7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 1e7a64bb
Call Trace:
  [<78404217>] _spin_unlock_irqrestore+0xf/0x23
  [<7812c708>] __mod_timer+0x92/0x9c
  [<78402b34>] schedule_timeout+0x70/0x8d
  [<7812c521>] process_timeout+0x0/0x5
  [<78402548>] io_schedule_timeout+0x1e/0x28
  [<7814d41e>] congestion_wait+0x50/0x64
  [<78134abc>] autoremove_wake_function+0x0/0x35
  [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
  [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
  [<783c55a1>] unix_dgram_recvmsg+0x1b4/0x1c8
  [<78128c8e>] current_fs_time+0x41/0x46
  [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
  [<7814621b>] generic_file_aio_write+0x55/0xb3
  [<78194b28>] ext3_file_write+0x24/0x8f
  [<7815f34f>] do_sync_readv_writev+0xc1/0xfe
  [<78134abc>] autoremove_wake_function+0x0/0x35
  [<784041ae>] _spin_unlock+0xd/0x21
  [<781a8c38>] log_wait_commit+0xc3/0xe3
  [<7814448b>] find_get_pages_tag+0x76/0x80
  [<7815f204>] rw_copy_check_uvector+0x50/0xaa
  [<7815f9d4>] do_readv_writev+0x99/0x164
  [<78194b04>] ext3_file_write+0x0/0x8f
  [<7815fadc>] vfs_writev+0x3d/0x48
  [<7815feb5>] sys_writev+0x41/0x67
  [<78103d6a>] sysenter_past_esp+0x5f/0x85
  ===


This trace puzzles me, what is: unix_dgram_recvmsg doing there.
Also, it has two invocations of: ext3_file_write
do you have a stacked filesystem of sorts, ext3 on loopback on ext3?


No, no loopback:

# mount
/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
(rw,nosuid,nodev,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
(ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)

Nothing more.


freshclam D 0282 0  2866  1 (NOTLB)
f36e3cc4 0082 0009 0282 7a0173c0 0002  007b
0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee f7cb813c
7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 1e7a64bb
Call Trace:
  [<78404217>] _spin_unlock_irqrestore+0xf/0x23
  [<7812c708>] __mod_timer+0x92/0x9c
  [<78402b34>] schedule_timeout+0x70/0x8d
  [<7812c521>] process_timeout+0x0/0x5
  [<78402548>] io_schedule_timeout+0x1e/0x28
  [<7814d41e>] congestion_wait+0x50/0x64
  [<78134abc>] autoremove_wake_function+0x0/0x35
  [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
  [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
  [<7819cdb4>] __ext3_journal_stop+0x19/0x34
  [<7840408f>] _spin_lock+0xd/0x5a
  [<78176f3d>] __mark_inode_dirty+0xdd/0x16f
  [<78128c8e>] current_fs_time+0x41/0x46
  [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
  [<7814621b>] generic_file_aio_write+0x55/0xb3
  [<78103159>] setup_sigcontext+0x105/0x189
  [<78194b28>] ext3_file_write+0x24/0x8f
  [<7815f453>] do_sync_write+0xc7/0x10a
  [<78134abc>] autoremove_wake_function+0x0/0x35
  [<781085d2>] convert_fxsr_from_user+0x15/0xd5
  [<7815f38c>] do_sync_write+0x0/0x10a
  [<7815fbb6>] vfs_write+0x8a/0x10c
  [<78160123>] sys_write+0x41/0x67
  [<78103d6a>] sysenter_past_esp+0x5f/0x85
  ===


single write, no networking, also stuck in balance_dirty_pages().


Exactly. Strange, isn't it?

Thanks.

Best regards,

Krzysztof Olędzki

Re: Strange system hangs

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 10:42 +0200, Krzysztof Oledzki wrote:
> Hello,
> 
> I am experiencing weird system hangs. Once about 2-5 weeks system freezes 
> and stops accepting remote connections, so it is no longer possible to 
> connect to most important services: smtp (postfix), www (squid) or even 
> ssh. Such connection is accepted but then it hangs.
> 
> What is strange, that previously established ssh session is usable. It is 
> possible to work on such system until you do something stupid like "less 
> /var/log/all.log".

So it takes weeks to reproduce this?


>   freesibling
>task PCstack   pid father child younger older
> syslogd   D F5C83C60 0  2162  1 (NOTLB)
> f5c83c74 0082 0002 f5c83c60 f5c83c5c   
> 78538d20
> 0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 
> f7f6a17c
> 7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 
> 1e7a64bb
> Call Trace:
>   [<78404217>] _spin_unlock_irqrestore+0xf/0x23
>   [<7812c708>] __mod_timer+0x92/0x9c
>   [<78402b34>] schedule_timeout+0x70/0x8d
>   [<7812c521>] process_timeout+0x0/0x5
>   [<78402548>] io_schedule_timeout+0x1e/0x28
>   [<7814d41e>] congestion_wait+0x50/0x64
>   [<78134abc>] autoremove_wake_function+0x0/0x35
>   [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
>   [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
>   [<783c55a1>] unix_dgram_recvmsg+0x1b4/0x1c8
>   [<78128c8e>] current_fs_time+0x41/0x46
>   [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
>   [<7814621b>] generic_file_aio_write+0x55/0xb3
>   [<78194b28>] ext3_file_write+0x24/0x8f
>   [<7815f34f>] do_sync_readv_writev+0xc1/0xfe
>   [<78134abc>] autoremove_wake_function+0x0/0x35
>   [<784041ae>] _spin_unlock+0xd/0x21
>   [<781a8c38>] log_wait_commit+0xc3/0xe3
>   [<7814448b>] find_get_pages_tag+0x76/0x80
>   [<7815f204>] rw_copy_check_uvector+0x50/0xaa
>   [<7815f9d4>] do_readv_writev+0x99/0x164
>   [<78194b04>] ext3_file_write+0x0/0x8f
>   [<7815fadc>] vfs_writev+0x3d/0x48
>   [<7815feb5>] sys_writev+0x41/0x67
>   [<78103d6a>] sysenter_past_esp+0x5f/0x85
>   ===

This trace puzzles me, what is: unix_dgram_recvmsg doing there.
Also, it has two invocations of: ext3_file_write
do you have a stacked filesystem of sorts, ext3 on loopback on ext3?



> freshclam D 0282 0  2866  1 (NOTLB)
> f36e3cc4 0082 0009 0282 7a0173c0 0002  
> 007b
> 0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee 
> f7cb813c
> 7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 
> 1e7a64bb
> Call Trace:
>   [<78404217>] _spin_unlock_irqrestore+0xf/0x23
>   [<7812c708>] __mod_timer+0x92/0x9c
>   [<78402b34>] schedule_timeout+0x70/0x8d
>   [<7812c521>] process_timeout+0x0/0x5
>   [<78402548>] io_schedule_timeout+0x1e/0x28
>   [<7814d41e>] congestion_wait+0x50/0x64
>   [<78134abc>] autoremove_wake_function+0x0/0x35
>   [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
>   [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
>   [<7819cdb4>] __ext3_journal_stop+0x19/0x34
>   [<7840408f>] _spin_lock+0xd/0x5a
>   [<78176f3d>] __mark_inode_dirty+0xdd/0x16f
>   [<78128c8e>] current_fs_time+0x41/0x46
>   [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
>   [<7814621b>] generic_file_aio_write+0x55/0xb3
>   [<78103159>] setup_sigcontext+0x105/0x189
>   [<78194b28>] ext3_file_write+0x24/0x8f
>   [<7815f453>] do_sync_write+0xc7/0x10a
>   [<78134abc>] autoremove_wake_function+0x0/0x35
>   [<781085d2>] convert_fxsr_from_user+0x15/0xd5
>   [<7815f38c>] do_sync_write+0x0/0x10a
>   [<7815fbb6>] vfs_write+0x8a/0x10c
>   [<78160123>] sys_write+0x41/0x67
>   [<78103d6a>] sysenter_past_esp+0x5f/0x85
>   ===

single write, no networking, also stuck in balance_dirty_pages().

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange system hangs

2007-09-28 Thread Krzysztof Oledzki

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes 
and stops accepting remote connections, so it is no longer possible to 
connect to most important services: smtp (postfix), www (squid) or even 
ssh. Such connection is accepted but then it hangs.


What is strange, that previously established ssh session is usable. It is 
possible to work on such system until you do something stupid like "less 
/var/log/all.log". Using strace I found that process blocks on:


--- strace: being ---
execve("/usr/bin/tail", ["tail", "-f", "/var/log/all.log"], [/* 33 vars */]) = 0
brk(0)  = 0x8052000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x6ff0
access("/etc/ld.so.preload", R_OK)  = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=20944, ...}) = 0
mmap2(NULL, 20944, PROT_READ, MAP_PRIVATE, 3, 0) = 0x6fefa000
close(3)= 0
open("/lib/libc.so.6", O_RDONLY)= 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0RY\1\0004\0\0\0"..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1175920, ...}) = 0
mmap2(NULL, 1185212, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x6fdd8000
mmap2(0x6fef4000, 12288, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x11b) = 0x6fef4000
mmap2(0x6fef7000, 9660, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6fef7000
close(3)= 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x6fdd7000
set_thread_area({entry_number:-1 -> 6, base_addr:0x6fdd76b0, limit:1048575, 
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, 
useable:1}) = 0
mprotect(0x6fef4000, 4096, PROT_READ)   = 0
mprotect(0x6ff1c000, 4096, PROT_READ)   = 0
munmap(0x6fefa000, 20944)   = 0
brk(0)  = 0x8052000
brk(0x8073000)  = 0x8073000
open("/var/log/all.log", O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0640, st_size=3171841, ...})
llseek(3, 0,  
--- strace: end ---

This file is not very big:

# ls -l /var/log/all.log
-rw-r- 1 root root 3171841 Sep 27 04:36 /var/log/all.log

Also running "dmesg > file" hangs, creating a file with only 4096 bytes.

--- Show Blocked State: begin ---
SysRq : Show Blocked State

 freesibling
  task PCstack   pid father child younger older
syslogd   D F5C83C60 0  2162  1 (NOTLB)
   f5c83c74 0082 0002 f5c83c60 f5c83c5c   78538d20
   0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 f7f6a17c
   7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 1e7a64bb
Call Trace:
 [<78404217>] _spin_unlock_irqrestore+0xf/0x23
 [<7812c708>] __mod_timer+0x92/0x9c
 [<78402b34>] schedule_timeout+0x70/0x8d
 [<7812c521>] process_timeout+0x0/0x5
 [<78402548>] io_schedule_timeout+0x1e/0x28
 [<7814d41e>] congestion_wait+0x50/0x64
 [<78134abc>] autoremove_wake_function+0x0/0x35
 [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
 [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
 [<783c55a1>] unix_dgram_recvmsg+0x1b4/0x1c8
 [<78128c8e>] current_fs_time+0x41/0x46
 [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
 [<7814621b>] generic_file_aio_write+0x55/0xb3
 [<78194b28>] ext3_file_write+0x24/0x8f
 [<7815f34f>] do_sync_readv_writev+0xc1/0xfe
 [<78134abc>] autoremove_wake_function+0x0/0x35
 [<784041ae>] _spin_unlock+0xd/0x21
 [<781a8c38>] log_wait_commit+0xc3/0xe3
 [<7814448b>] find_get_pages_tag+0x76/0x80
 [<7815f204>] rw_copy_check_uvector+0x50/0xaa
 [<7815f9d4>] do_readv_writev+0x99/0x164
 [<78194b04>] ext3_file_write+0x0/0x8f
 [<7815fadc>] vfs_writev+0x3d/0x48
 [<7815feb5>] sys_writev+0x41/0x67
 [<78103d6a>] sysenter_past_esp+0x5f/0x85
 ===
freshclam D 0282 0  2866  1 (NOTLB)
   f36e3cc4 0082 0009 0282 7a0173c0 0002  007b
   0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee f7cb813c
   7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 1e7a64bb
Call Trace:
 [<78404217>] _spin_unlock_irqrestore+0xf/0x23
 [<7812c708>] __mod_timer+0x92/0x9c
 [<78402b34>] schedule_timeout+0x70/0x8d
 [<7812c521>] process_timeout+0x0/0x5
 [<78402548>] io_schedule_timeout+0x1e/0x28
 [<7814d41e>] congestion_wait+0x50/0x64
 [<78134abc>] autoremove_wake_function+0x0/0x35
 [<781493e7>] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
 [<78145bd0>] generic_file_buffered_write+0x4ee/0x605
 [<7819cdb4>] __ext3_journal_stop+0x19/0x34
 [<7840408f>] _spin_lock+0xd/0x5a
 [<78176f3d>] __mark_inode_dirty+0xdd/0x16f
 [<78128c8e>] current_fs_time+0x41/0x46
 [<78146167>] __generic_file_aio_write_nolock+0x480/0x4df
 [<7814621b>] generic_file_aio_write+0x55/0xb3
 [<78103159>] 

Re: Strange system hangs

2007-09-28 Thread Peter Zijlstra
On Fri, 2007-09-28 at 10:42 +0200, Krzysztof Oledzki wrote:
 Hello,
 
 I am experiencing weird system hangs. Once about 2-5 weeks system freezes 
 and stops accepting remote connections, so it is no longer possible to 
 connect to most important services: smtp (postfix), www (squid) or even 
 ssh. Such connection is accepted but then it hangs.
 
 What is strange, that previously established ssh session is usable. It is 
 possible to work on such system until you do something stupid like less 
 /var/log/all.log.

So it takes weeks to reproduce this?


   freesibling
task PCstack   pid father child younger older
 syslogd   D F5C83C60 0  2162  1 (NOTLB)
 f5c83c74 0082 0002 f5c83c60 f5c83c5c   
 78538d20
 0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 
 f7f6a17c
 7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 
 1e7a64bb
 Call Trace:
   [78404217] _spin_unlock_irqrestore+0xf/0x23
   [7812c708] __mod_timer+0x92/0x9c
   [78402b34] schedule_timeout+0x70/0x8d
   [7812c521] process_timeout+0x0/0x5
   [78402548] io_schedule_timeout+0x1e/0x28
   [7814d41e] congestion_wait+0x50/0x64
   [78134abc] autoremove_wake_function+0x0/0x35
   [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
   [78145bd0] generic_file_buffered_write+0x4ee/0x605
   [783c55a1] unix_dgram_recvmsg+0x1b4/0x1c8
   [78128c8e] current_fs_time+0x41/0x46
   [78146167] __generic_file_aio_write_nolock+0x480/0x4df
   [7814621b] generic_file_aio_write+0x55/0xb3
   [78194b28] ext3_file_write+0x24/0x8f
   [7815f34f] do_sync_readv_writev+0xc1/0xfe
   [78134abc] autoremove_wake_function+0x0/0x35
   [784041ae] _spin_unlock+0xd/0x21
   [781a8c38] log_wait_commit+0xc3/0xe3
   [7814448b] find_get_pages_tag+0x76/0x80
   [7815f204] rw_copy_check_uvector+0x50/0xaa
   [7815f9d4] do_readv_writev+0x99/0x164
   [78194b04] ext3_file_write+0x0/0x8f
   [7815fadc] vfs_writev+0x3d/0x48
   [7815feb5] sys_writev+0x41/0x67
   [78103d6a] sysenter_past_esp+0x5f/0x85
   ===

This trace puzzles me, what is: unix_dgram_recvmsg doing there.
Also, it has two invocations of: ext3_file_write
do you have a stacked filesystem of sorts, ext3 on loopback on ext3?



 freshclam D 0282 0  2866  1 (NOTLB)
 f36e3cc4 0082 0009 0282 7a0173c0 0002  
 007b
 0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee 
 f7cb813c
 7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 
 1e7a64bb
 Call Trace:
   [78404217] _spin_unlock_irqrestore+0xf/0x23
   [7812c708] __mod_timer+0x92/0x9c
   [78402b34] schedule_timeout+0x70/0x8d
   [7812c521] process_timeout+0x0/0x5
   [78402548] io_schedule_timeout+0x1e/0x28
   [7814d41e] congestion_wait+0x50/0x64
   [78134abc] autoremove_wake_function+0x0/0x35
   [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
   [78145bd0] generic_file_buffered_write+0x4ee/0x605
   [7819cdb4] __ext3_journal_stop+0x19/0x34
   [7840408f] _spin_lock+0xd/0x5a
   [78176f3d] __mark_inode_dirty+0xdd/0x16f
   [78128c8e] current_fs_time+0x41/0x46
   [78146167] __generic_file_aio_write_nolock+0x480/0x4df
   [7814621b] generic_file_aio_write+0x55/0xb3
   [78103159] setup_sigcontext+0x105/0x189
   [78194b28] ext3_file_write+0x24/0x8f
   [7815f453] do_sync_write+0xc7/0x10a
   [78134abc] autoremove_wake_function+0x0/0x35
   [781085d2] convert_fxsr_from_user+0x15/0xd5
   [7815f38c] do_sync_write+0x0/0x10a
   [7815fbb6] vfs_write+0x8a/0x10c
   [78160123] sys_write+0x41/0x67
   [78103d6a] sysenter_past_esp+0x5f/0x85
   ===

single write, no networking, also stuck in balance_dirty_pages().

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Strange system hangs

2007-09-28 Thread Krzysztof Oledzki

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes 
and stops accepting remote connections, so it is no longer possible to 
connect to most important services: smtp (postfix), www (squid) or even 
ssh. Such connection is accepted but then it hangs.


What is strange, that previously established ssh session is usable. It is 
possible to work on such system until you do something stupid like less 
/var/log/all.log. Using strace I found that process blocks on:


--- strace: being ---
execve(/usr/bin/tail, [tail, -f, /var/log/all.log], [/* 33 vars */]) = 0
brk(0)  = 0x8052000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x6ff0
access(/etc/ld.so.preload, R_OK)  = -1 ENOENT (No such file or directory)
open(/etc/ld.so.cache, O_RDONLY)  = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=20944, ...}) = 0
mmap2(NULL, 20944, PROT_READ, MAP_PRIVATE, 3, 0) = 0x6fefa000
close(3)= 0
open(/lib/libc.so.6, O_RDONLY)= 3
read(3, \177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0RY\1\0004\0\0\0..., 
512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1175920, ...}) = 0
mmap2(NULL, 1185212, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 
0x6fdd8000
mmap2(0x6fef4000, 12288, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x11b) = 0x6fef4000
mmap2(0x6fef7000, 9660, PROT_READ|PROT_WRITE, 
MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x6fef7000
close(3)= 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x6fdd7000
set_thread_area({entry_number:-1 - 6, base_addr:0x6fdd76b0, limit:1048575, 
seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, 
useable:1}) = 0
mprotect(0x6fef4000, 4096, PROT_READ)   = 0
mprotect(0x6ff1c000, 4096, PROT_READ)   = 0
munmap(0x6fefa000, 20944)   = 0
brk(0)  = 0x8052000
brk(0x8073000)  = 0x8073000
open(/var/log/all.log, O_RDONLY|O_LARGEFILE) = 3
fstat64(3, {st_mode=S_IFREG|0640, st_size=3171841, ...})
llseek(3, 0,  unfinished ...
--- strace: end ---

This file is not very big:

# ls -l /var/log/all.log
-rw-r- 1 root root 3171841 Sep 27 04:36 /var/log/all.log

Also running dmesg  file hangs, creating a file with only 4096 bytes.

--- Show Blocked State: begin ---
SysRq : Show Blocked State

 freesibling
  task PCstack   pid father child younger older
syslogd   D F5C83C60 0  2162  1 (NOTLB)
   f5c83c74 0082 0002 f5c83c60 f5c83c5c   78538d20
   0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 f7f6a17c
   7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 1e7a64bb
Call Trace:
 [78404217] _spin_unlock_irqrestore+0xf/0x23
 [7812c708] __mod_timer+0x92/0x9c
 [78402b34] schedule_timeout+0x70/0x8d
 [7812c521] process_timeout+0x0/0x5
 [78402548] io_schedule_timeout+0x1e/0x28
 [7814d41e] congestion_wait+0x50/0x64
 [78134abc] autoremove_wake_function+0x0/0x35
 [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
 [78145bd0] generic_file_buffered_write+0x4ee/0x605
 [783c55a1] unix_dgram_recvmsg+0x1b4/0x1c8
 [78128c8e] current_fs_time+0x41/0x46
 [78146167] __generic_file_aio_write_nolock+0x480/0x4df
 [7814621b] generic_file_aio_write+0x55/0xb3
 [78194b28] ext3_file_write+0x24/0x8f
 [7815f34f] do_sync_readv_writev+0xc1/0xfe
 [78134abc] autoremove_wake_function+0x0/0x35
 [784041ae] _spin_unlock+0xd/0x21
 [781a8c38] log_wait_commit+0xc3/0xe3
 [7814448b] find_get_pages_tag+0x76/0x80
 [7815f204] rw_copy_check_uvector+0x50/0xaa
 [7815f9d4] do_readv_writev+0x99/0x164
 [78194b04] ext3_file_write+0x0/0x8f
 [7815fadc] vfs_writev+0x3d/0x48
 [7815feb5] sys_writev+0x41/0x67
 [78103d6a] sysenter_past_esp+0x5f/0x85
 ===
freshclam D 0282 0  2866  1 (NOTLB)
   f36e3cc4 0082 0009 0282 7a0173c0 0002  007b
   0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee f7cb813c
   7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 1e7a64bb
Call Trace:
 [78404217] _spin_unlock_irqrestore+0xf/0x23
 [7812c708] __mod_timer+0x92/0x9c
 [78402b34] schedule_timeout+0x70/0x8d
 [7812c521] process_timeout+0x0/0x5
 [78402548] io_schedule_timeout+0x1e/0x28
 [7814d41e] congestion_wait+0x50/0x64
 [78134abc] autoremove_wake_function+0x0/0x35
 [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
 [78145bd0] generic_file_buffered_write+0x4ee/0x605
 [7819cdb4] __ext3_journal_stop+0x19/0x34
 [7840408f] _spin_lock+0xd/0x5a
 [78176f3d] __mark_inode_dirty+0xdd/0x16f
 [78128c8e] current_fs_time+0x41/0x46
 [78146167] __generic_file_aio_write_nolock+0x480/0x4df
 [7814621b] generic_file_aio_write+0x55/0xb3
 [78103159] setup_sigcontext+0x105/0x189
 [78194b28] ext3_file_write+0x24/0x8f
 [7815f453] 

Re: Strange system hangs

2007-09-28 Thread Krzysztof Oledzki



On Fri, 28 Sep 2007, Peter Zijlstra wrote:


On Fri, 2007-09-28 at 10:42 +0200, Krzysztof Oledzki wrote:

Hello,

I am experiencing weird system hangs. Once about 2-5 weeks system freezes
and stops accepting remote connections, so it is no longer possible to
connect to most important services: smtp (postfix), www (squid) or even
ssh. Such connection is accepted but then it hangs.

What is strange, that previously established ssh session is usable. It is
possible to work on such system until you do something stupid like less
/var/log/all.log.


So it takes weeks to reproduce this?


Unfortunately, yes. :(


  freesibling
   task PCstack   pid father child younger older
syslogd   D F5C83C60 0  2162  1 (NOTLB)
f5c83c74 0082 0002 f5c83c60 f5c83c5c   78538d20
0009 0001 f7f6a070 f7cb8030 82c47e5f 0001cfed 0a43 f7f6a17c
7a016980 f705dc80 78404217 7812c708  0213 f5c83c84 1e7a64bb
Call Trace:
  [78404217] _spin_unlock_irqrestore+0xf/0x23
  [7812c708] __mod_timer+0x92/0x9c
  [78402b34] schedule_timeout+0x70/0x8d
  [7812c521] process_timeout+0x0/0x5
  [78402548] io_schedule_timeout+0x1e/0x28
  [7814d41e] congestion_wait+0x50/0x64
  [78134abc] autoremove_wake_function+0x0/0x35
  [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
  [78145bd0] generic_file_buffered_write+0x4ee/0x605
  [783c55a1] unix_dgram_recvmsg+0x1b4/0x1c8
  [78128c8e] current_fs_time+0x41/0x46
  [78146167] __generic_file_aio_write_nolock+0x480/0x4df
  [7814621b] generic_file_aio_write+0x55/0xb3
  [78194b28] ext3_file_write+0x24/0x8f
  [7815f34f] do_sync_readv_writev+0xc1/0xfe
  [78134abc] autoremove_wake_function+0x0/0x35
  [784041ae] _spin_unlock+0xd/0x21
  [781a8c38] log_wait_commit+0xc3/0xe3
  [7814448b] find_get_pages_tag+0x76/0x80
  [7815f204] rw_copy_check_uvector+0x50/0xaa
  [7815f9d4] do_readv_writev+0x99/0x164
  [78194b04] ext3_file_write+0x0/0x8f
  [7815fadc] vfs_writev+0x3d/0x48
  [7815feb5] sys_writev+0x41/0x67
  [78103d6a] sysenter_past_esp+0x5f/0x85
  ===


This trace puzzles me, what is: unix_dgram_recvmsg doing there.
Also, it has two invocations of: ext3_file_write
do you have a stacked filesystem of sorts, ext3 on loopback on ext3?


No, no loopback:

# mount
/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 
(rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 
(rw,nosuid,nodev,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs 
(ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)

Nothing more.


freshclam D 0282 0  2866  1 (NOTLB)
f36e3cc4 0082 0009 0282 7a0173c0 0002  007b
0009 0001 f7cb8030 f7c72030 82c4884d 0001cfed 09ee f7cb813c
7a016980 f66c0b80 78404217 7812c708  0213 f36e3cd4 1e7a64bb
Call Trace:
  [78404217] _spin_unlock_irqrestore+0xf/0x23
  [7812c708] __mod_timer+0x92/0x9c
  [78402b34] schedule_timeout+0x70/0x8d
  [7812c521] process_timeout+0x0/0x5
  [78402548] io_schedule_timeout+0x1e/0x28
  [7814d41e] congestion_wait+0x50/0x64
  [78134abc] autoremove_wake_function+0x0/0x35
  [781493e7] balance_dirty_pages_ratelimited_nr+0x16e/0x1dc
  [78145bd0] generic_file_buffered_write+0x4ee/0x605
  [7819cdb4] __ext3_journal_stop+0x19/0x34
  [7840408f] _spin_lock+0xd/0x5a
  [78176f3d] __mark_inode_dirty+0xdd/0x16f
  [78128c8e] current_fs_time+0x41/0x46
  [78146167] __generic_file_aio_write_nolock+0x480/0x4df
  [7814621b] generic_file_aio_write+0x55/0xb3
  [78103159] setup_sigcontext+0x105/0x189
  [78194b28] ext3_file_write+0x24/0x8f
  [7815f453] do_sync_write+0xc7/0x10a
  [78134abc] autoremove_wake_function+0x0/0x35
  [781085d2] convert_fxsr_from_user+0x15/0xd5
  [7815f38c] do_sync_write+0x0/0x10a
  [7815fbb6] vfs_write+0x8a/0x10c
  [78160123] sys_write+0x41/0x67
  [78103d6a] sysenter_past_esp+0x5f/0x85
  ===


single write, no networking, also stuck in balance_dirty_pages().


Exactly. Strange, isn't it?

Thanks.

Best regards,

Krzysztof Olędzki