I'm kind of buried in work right now, but your problems appear serious, so I'm trying to respond as quick as I can.
On Wed, 2005-09-14 at 23:41 +0100, Peter Grandi wrote: > The bad news is that I have already suffered from several > crashes and one bizarre performance problem... My setup > consists of an Athlon Xp 2000+, 512MB, 2x80GB and 2x160GB hard > discs, running a mainline 2.6.13 kernel, with 1.1.18 'jfsprogs'. > > The incidents so far: > > * Some of my tests were tree traversals, that generate a flood of > inode updates because, which hit the journal hard. So I wondered > what would the timings be with '-o noatime', unfortunately I > got a crash because of that. noatime is used a lot, I don't think noatime was a direct cause of the problem, but it might have affected memory reuse (as inodes should be easier to reclaim if they aren't being marked dirty). More on the crash below. > * When converting from 'ext3' to JFS file systems, I did this by > copying things around, and I got a couple of lockups. It may > be that these were related to high buffer cache traffic (I was > doing a large 'dd' between partitions at one time) and races > thereof. No idea what could be happening here. If you could capture a stack trace of the processes, it may give me a clue what's going on. 'echo t > /proc/sysrq-trigger' should dump the stack traces to the syslog. > * When restoring a '.tar.bz2' held on a 'vfat' file system to a > newly formatted 'jfs' one I got a dtree corruption, with no > device errors. I 'fsck'ed it to fix that and redid the restore > and it did not happen again. There was again a 'dd' between > two partitions running at the same time. I had another recent report of dtree corruption that I wasn't sure of the cause. I suspected it might be related to a case-insensitivity bug, but that couldn't be the case here. (I highly doubt that you ran mkfs -O to create the partition.) > * Making a file system with a 30MiB log instead of the default > 32MiB makes reading it with 'tar' over twice as slow. This for > the same partition on the same hard disc with the same content > freshly loaded (it was so strange I checked several times). You're right that this is strange. If you are running with noatime, the journal shouldn't be a factor at all when reading the volume. This one really puzzles me. > All which leads to think that not many people have used non > default log sizes, or used JFS with FAT32 or massive 'dd'ing, or > with 'noatime'... :-) Again I think that noatime is pretty common. I use it. > Some more context and some data... I was in multiuser but not > GUI mode when the incidents above happened, with only a few > dæmons running. > > The output of 'jfs_fsck' after the «DT_GETPAGE: dtree page > corrupt» errors: If you see this again, run jfs_fsck with the -v flag. That may give me a better idea of the nature of the dtree corruption. > > The one ''oops'' that got logged (it happened twice): > > ---------------------------------------------------------------- > Unable to handle kernel paging request at virtual address cc05b9a4 > printing eip: > c0251f5d > *pde = 00030067 > *pte = 0c05b000 > Oops: 0000 [#1] > DEBUG_PAGEALLOC > Modules linked in: binfmt_misc snd_cmipci snd_opl3_lib snd_hwdep snd_seq_oss > snd_seq_midi snd_seq_midi_event snd_seq snd_via82xx gameport snd_ac97_codec > snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_mpu401_uart > snd_rawmidi snd_seq_device snd soundcore 3c59x mii parport_pc lp parport > video thermal processor fan container button battery ac it87 eeprom > i2c_sensor i2c_isa i2c_dev i2c_core ntfs nls_iso8859_1 nls_cp437 sg sr_mod > ide_scsi scsi_mod 8250 serial_core nvram rtc > CPU: 0 > EIP: 0060:[txUpdateMap+333/656] Not tainted VLI > EFLAGS: 00010246 (2.6.13p) > EIP is at txUpdateMap+0x14d/0x290 > eax: cc05b97c ebx: e0996990 ecx: e08366c8 edx: 00000900 > esi: 00000001 edi: e0996980 ebp: dfdc7f48 esp: dfdc7f10 > ds: 007b es: 007b ss: 0068 > Process jfsCommit (pid: 139, threadinfo=dfdc7000 task=c15725d0) > Stack: e084be30 0000060c dfdc7f48 c024f181 00000000 00000040 d94596fc > dbefc2fc > 00000202 00000000 00000000 dc64d160 e08366c8 e08366c8 dfdc7f74 > c02529b2 > e08366c8 00000286 e0861514 dfdc7fe4 00000000 0000007b 0000007b > dc64d160 > Call Trace: > [show_stack+127/160] show_stack+0x7f/0xa0 > [show_registers+343/448] show_registers+0x157/0x1c0 > [die+332/688] die+0x14c/0x2b0 > [do_page_fault+921/1791] do_page_fault+0x399/0x6ff > [error_code+79/84] error_code+0x4f/0x54 > [txLazyCommit+34/688] txLazyCommit+0x22/0x2b0 > [jfs_lazycommit+844/1200] jfs_lazycommit+0x34c/0x4b0 > [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10 > Code: f6 47 04 02 0f 85 4f 01 00 00 8d 5f 10 0f b6 43 03 85 c0 74 4d 89 c6 8d > b4 26 00 00 00 00 f6 43 04 f0 0f 85 16 01 00 00 8b 47 0c <0f> b7 40 28 25 00 > f0 00 00 3d 00 40 00 00 0f 84 ef 00 00 00 8b This was enough to tell me what's going on. txUpdateMap should not be accessing tlck->ip, since it may no longer be valid. I think DEBUG_PAGEALLOC helped uncover this bug. It shouldn't be too hard to fix, but it isn't trivial either. I'll try to get a patch to you soon. > Sorry for the relative lack of details, I hope that there is > enough to start an investigation. Thanks this was helpful. It was enough to discover one bug, and hopefully we can make some progress on the other ones. Thanks, Shaggy -- David Kleikamp IBM Linux Technology Center ------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Jfs-discussion mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/jfs-discussion
