"Vonlanthen, Elmar":
> > > I will check the memory as well.
>
> The memory check didn't report any error, but I replaced the whole
> memory anyway.
Oh, that is not what I meant.
I was talking about "some other SOFTWARE component MAY overwrite
something on memory". It never mean the HARDWARE error.
If this misunderstanding forced you to buy a new memory (hardware), I
feel sorry.
> The kernel or any other process don't log any other problem. The only
> thing I see is "mount" which remains blocked.
Ok, understood.
> I can reproducte it again and again on two machines and even with
> replaced RAM. A new awareness is that sometimes the command is
> unblocking again (after a few minutes). But not everytime.
I have tested your "while" loop test for about an hour, but found
nothing wrong. But my system in linux-3.3-rcN instead of 3.2.1.
> Call Trace:
> [<c102862b>] ? try_to_wake_up+0x17b/0x1f0
> [<c1020dc0>] ? __wake_up_common+0x40/0x70
> [<c10ac1ef>] ? iput+0x2f/0x1a0
> [<c134b630>] schedule+0x30/0x50
> [<f80bbdfd>] au_hn_alloc+0x24d/0x370 [aufs]
> [<c10431a0>] ? wake_up_bit+0x60/0x60
> [<f80bbb8c>] au_hn_free+0x1c/0x40 [aufs]
> [<f80b36cb>] au_hiput+0xb/0x20 [aufs]
> [<f80b380b>] au_iinfo_fin+0x12b/0x1a0 [aufs]
> [<f80a13ab>] au_si_free+0xabb/0xc00 [aufs]
> [<c10ac04c>] destroy_inode+0x2c/0x50
> [<c10ac143>] evict+0xd3/0x150
> [<c10ac27d>] iput+0xbd/0x1a0
> [<c10aa1ef>] d_kill+0x9f/0xf0
> [<c10aa3f0>] shrink_dentry_list+0x1b0/0x1d0
> [<c10aa8ae>] shrink_dcache_sb+0x5e/0x90
> [<c1098b25>] do_remount_sb+0x35/0x160
> [<c1034110>] ? ns_capable+0x20/0x50
> [<c10b1870>] do_mount+0x500/0x6e0
> [<c101b840>] ? mm_fault_error+0x130/0x130
> [<c10b0208>] ? copy_mount_options+0x98/0x110
> [<c10b1ab6>] sys_mount+0x66/0xa0
> [<c134d065>] syscall_call+0x7/0xb
This call trace may be unreliable too.
You can see destroy_inode() calling au_si_free() as well as au_hn_free()
calling au_hn_alloc(), but there is no such calls in the source files.
Look at the VFS function destroy_inode() in linux/fs/inode.c, and you
can find what I mean.
> You say, that au_hn_alloc() cannot follow au_hn_free(). But how can it
> be, that can reproduce it again and again? Which code is executing the
> function "wake_up_bit()"?
It is prefixed by '?' in the call trace which means that the address is
unreliable. In other words, if the function name is not prefixed by '?',
it is reliable.
> Is there everything else I can try?
- MagicSysRq + A
or
- set the aufs module parameter 'debug' to 1, just before the hang mount
But I am afraid they may not help, since your call trace looks
unbelievalbe to me.
Another option is modifying aufs-util/mount.aufs.c.
Arount the line 223, you will see
flags[AuFlush] = test_flush(opts);
if (flags[AuFlush] /* && !flags[Fake] */) {
err = au_plink(cwd, AuPlink_FLUSH,
AuPlinkFlag_OPEN | AuPlinkFlag_CLOEXEC,
&fd);
if (err)
AuFin(NULL);
}
In your case, the function au_plink() is not called. But calling it may
break the current situation. So I'd suggest you to set flags[AuFlush] to
1 regardless the return value from test_flush().
J. R. Okajima
------------------------------------------------------------------------------
Keep Your Developer Skills Current with LearnDevNow!
The most comprehensive online learning library for Microsoft developers
is just $99.99! Visual Studio, SharePoint, SQL - plus HTML5, CSS3, MVC3,
Metro Style Apps, more. Free future releases when you subscribe now!
http://p.sf.net/sfu/learndevnow-d2d