On May 16, 2008 12:45 +0200, Patrick Winnertz wrote: > As I wrote in #11742 [1] I experienced a kernel panic after doing heavy I/O > on the 1.6.5rc2 cluster on the mds. Since nobody answered to this bug > until now (and I think in other cases the lustre team is _really_ fast > (thanks for that :))) I fear that it was not recognised by anybody. > > This kernel-panic seems somehow to be related to the bug mentioned above > (#11742) as this bugnr. is mentioned in the dmesg output when it died. > Furthermore right before it started to fail there were several messages > like the following: > > LustreError: 3342:0:(osc_request.c:678:osc_announce_cached()) dirty > 81108992 > dirty_max 33554432 > > This behaviour is described in #13344 [2].
Sorry, I don't have net access right now, so I can't see your comments in the bug, but the above messsage is definitely unusual and an indication of some kind of code bug. The client imposes a limit on the amount of dirty data that it can cache (in /proc/fs/lustre/osc/*/max_dirty_mb, default 32MB), on a per-OST basis. This ensures that in case of lock cancellation there isn't 5TB of dirty data out on the client and flushing this to the OST will take 30min. It seems that either the accounting of the number of dirty pages on the client has gone badly, or the client has actually dirtied far more data (80MB) than it should have (32MB). Could you please explain the type of IO that the client is doing? Is this normal write(), or writev(), pwrite(), O_DIRECT, mmap, other? Were there IO errors, or IO resends, or some other unusual problem? The entry points for this IO into Lustre is all slightly different, and it wouldn't be the first time there was an accounting error somewhere. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss