Hello,
please try the attached patch.
On Wednesday 15 March 2006 10:50, Christian Trefzer wrote:
> Hi everyone,
>
> I got this half an hour ago, with some processes left in D state,
> namely ooffice.bin and two instances of procmail, as this happened on
> my /home LV:
>
> kernel BUG at
> /usr/src/sources/linux-2.6.16-rc5/fs/reiser4/plugin/file/tail_convers
>ion.c:29! invalid opcode: 0000 [#1]
> PREEMPT
> Modules linked in: mga drm w83781d hwmon_vid hwmon i2c_isa
> snd_seq_midi snd_pcm_oss snd_mixer_oss snd_seq_oss snd_seq_midi_event
> snd_seq snd_cmipci snd_opl3_lib snd_hwdep snd_mpu401_uart ohci_hcd
> floppy sr_mod cdrom pata_via i2c_viapro aic7xxx scsi_transport_spi
> ehci_hcd uhci_hcd 3c59x mii snd_ens1370 gameport snd_rawmidi
> snd_seq_device snd_pcm snd_timer snd_ak4531_codec snd soundcore
> snd_page_alloc via_agp agpgart usbcore xfs exportfs reiser4 ext2 loop
> lp parport_pc parport rtc psmouse reiserfs dm_mod raid5 raid1 xor
> md_mod pata_pdc2027x libata sd_mod scsi_mod unix CPU: 0
> EIP: 0060:[<f2daa02d>] Not tainted VLI
> EFLAGS: 00010286 (2.6.16-rc5 #10)
> EIP is at get_exclusive_access+0x31/0x44 [reiser4]
> eax: b26d6c04 ebx: 00000000 ecx: ec54bbf4 edx: b736fdc0
> esi: 3dbf3000 edi: 00006c85 ebp: 00006c85 esp: ded36f0c
> ds: 007b es: 007b ss: 0068
> Process soffice.bin (pid: 12533, threadinfo=ded36000 task=b3b7b070)
> Stack: <0>f2da83da 00000000 c52ca544 e52935a8 00007000 b014c75f
> c52ca544 e7b3cc80 b0151cd1 b6b54354 b6b5434c 3dbf3000 ed113360
> e6a02160 b26d6bc0 ec54bc4c ec54bbf4 00000000 00006c85 00000001
> 00000000 ec54bc00 00000000 00006c85 Call Trace:
> [<f2da83da>] write_unix_file+0x1ba/0x60c [reiser4]
> [<f2da8220>] write_unix_file+0x0/0x60c [reiser4]
> [<b0101135>] syscall_call+0x7/0xb
> Code: ff 21 e0 8b 00 8b 80 b0 04 00 00 8b 40 40 8b 50 08 85 d2 75 16
> ba 01 00 ff ff 89 c8 0f c1 10 85 d2 75 12 c7 41 24 01 00 00 00 c3
> <0f> 0b 1d 00 04 8c dc f2 eb e0 51 e8 13 ac 35 bd 59 eb e5 55 89
>
>
> I had another occurrence of something looking similar at first
> glance, repeatedly grinding my laptop to halt when I was on a trip.
> The only way to make it go away was to wipe the device by dd'ing
> /dev/zero to it. Not even tar-backup and mkfs did the job - otherwise
> I could have left out the word "repeatedly"...
>
> The only thing I could imagine other than a serious problem wrt.
> reiser4 code is a "soft" bad block relocated by the drive upon write,
> but there was nothing like a read error in the logs. Furthermore I
> wanted the gurus to know since it occured to me more than once.
>
>
> Thanks for your time!
> Chris
>
>
>
> FYI, here comes something about the disk, including SMART error log:
>
>
> /dev/sda:
>
> ATA device, with non-removable media
> Model Number: SAMSUNG SV1203N
> Serial Number: S01CJ10Y410901
> Firmware Revision: TQ100-30
> Standards:
> Supported: 7 6 5 4
> Likely used: 7
> Configuration:
> Logical max current
> cylinders 16383 16383
> heads 16 16
> sectors/track 63 63
> --
> CHS current addressable sectors: 16514064
> LBA user addressable sectors: 234493056
> LBA48 user addressable sectors: 234493056
> device size with M = 1024*1024: 114498 MBytes
> device size with M = 1000*1000: 120060 MBytes (120 GB)
> Capabilities:
> LBA, IORDY(can be disabled)
> Queue depth: 1
> Standby timer values: spec'd by Standard, no device specific minimum
> R/W multiple sector transfer: Max = 16 Current = 16
> Recommended acoustic management value: 254, current value: 254
> DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
> Cycle time: min=120ns recommended=120ns
> PIO: pio0 pio1 pio2 pio3 pio4
> Cycle time: no flow control=240ns IORDY flow control=120ns
> Commands/features:
> Enabled Supported:
> * READ BUFFER cmd
> * WRITE BUFFER cmd
> * Host Protected Area feature set
> * Look-ahead
> * Write cache
> * Power Management feature set
> Security Mode feature set
> * SMART feature set
> * FLUSH CACHE EXT command
> * Mandatory FLUSH CACHE command
> * Device Configuration Overlay feature set
> * 48-bit Address feature set
> * Automatic Acoustic Management feature set
> SET MAX security extension
> * DOWNLOAD MICROCODE cmd
> * SMART self-test
> * SMART error logging
> Security:
> Master password revision code = 65534
> supported
> not enabled
> not locked
> not frozen
> not expired: security count
> supported: enhanced erase
> 56min for SECURITY ERASE UNIT. 56min for ENHANCED SECURITY ERASE
> UNIT. HW reset results:
> CBLID- above Vih
> Device num = 0 determined by the jumper
> Checksum: correct
>
>
>
> smartctl version 5.33 [i386-pc-linux-gnu] Copyright (C) 2002-4 Bruce
> Allen Home page is http://smartmontools.sourceforge.net/
>
> === START OF READ SMART DATA SECTION ===
> SMART Error Log Version: 1
> ATA Error Count: 8 (device log contains only the most recent five
> errors) CR = Command Register [HEX]
> FR = Features Register [HEX]
> SC = Sector Count Register [HEX]
> SN = Sector Number Register [HEX]
> CL = Cylinder Low Register [HEX]
> CH = Cylinder High Register [HEX]
> DH = Device/Head Register [HEX]
> DC = Device Command Register [HEX]
> ER = Error register [HEX]
> ST = Status register [HEX]
> Powered_Up_Time is measured from power on, and printed as
> DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
> SS=sec, and sss=millisec. It "wraps" after 49.710 days.
>
> Error 8 occurred at disk power-on lifetime: 1595 hours (66 days + 11
> hours) When the command that caused the error occurred, the device
> was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 04 51 00 00 00 00 a0 Error: ABRT
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> b1 c0 00 00 00 00 a0 00 16:23:07.813 DEVICE CONFIGURATION
> RESTORE b1 c2 00 00 00 00 a0 00 16:23:07.813 DEVICE
> CONFIGURATION IDENTIFY 9a 23 04 00 02 00 a0 00 16:23:07.813
> [VENDOR SPECIFIC] 9a 23 04 00 02 00 a0 00 16:23:07.750 [VENDOR
> SPECIFIC] 9a 23 01 00 02 00 a0 00 16:23:07.750 [VENDOR
> SPECIFIC]
>
> Error 7 occurred at disk power-on lifetime: 1595 hours (66 days + 11
> hours) When the command that caused the error occurred, the device
> was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 02 51 3f 00 00 00 e0 Error: TK0NF
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> 10 00 3f 00 00 00 e0 00 16:10:46.563 RECALIBRATE [OBS-4]
> 91 00 3f 3f ff 3f e0 00 16:10:46.563 INITIALIZE DEVICE
> PARAMETERS [OBS-6] ef 03 45 01 00 00 a0 00 16:10:46.563 SET
> FEATURES [Set transfer mode] ef 03 0c 01 00 00 a0 00
> 16:10:46.563 SET FEATURES [Set transfer mode] ec 00 00 01 00 00 a0
> 00 16:10:45.813 IDENTIFY DEVICE
>
> Error 6 occurred at disk power-on lifetime: 1595 hours (66 days + 11
> hours) When the command that caused the error occurred, the device
> was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 02 51 00 00 00 00 e0 Error: TK0NF
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> 10 00 00 00 00 00 e0 00 16:10:32.625 RECALIBRATE [OBS-4]
> 00 00 01 01 00 00 a0 00 16:10:32.625 NOP [Abort queued
> commands]
>
> Error 5 occurred at disk power-on lifetime: 1595 hours (66 days + 11
> hours) When the command that caused the error occurred, the device
> was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 02 51 00 00 00 00 e0 Error: TK0NF
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> 10 00 00 00 00 00 e0 00 16:10:32.063 RECALIBRATE [OBS-4]
> 00 da 01 01 00 00 a0 00 16:10:32.063 NOP [Reserved
> subcommand] b0 da 10 01 4f c2 a0 00 16:10:25.688 SMART RETURN
> STATUS b0 d8 10 01 4f c2 a0 00 16:10:25.625 SMART ENABLE
> OPERATIONS c6 03 10 01 00 00 a0 00 16:10:25.625 SET MULTIPLE
> MODE
>
> Error 4 occurred at disk power-on lifetime: 483 hours (20 days + 3
> hours) When the command that caused the error occurred, the device
> was active or idle.
>
> After command completion occurred, registers were:
> ER ST SC SN CL CH DH
> -- -- -- -- -- -- --
> 04 51 00 00 4f c2 e0 Error: ABRT
>
> Commands leading to the command that caused the error were:
> CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
> -- -- -- -- -- -- -- -- ---------------- --------------------
> b0 da 00 00 4f c2 e0 00 21d+14:59:16.250 SMART RETURN STATUS
> ec 00 00 ac 87 e4 e0 00 21d+14:59:16.188 IDENTIFY DEVICE
> ef 82 00 00 00 00 e0 00 21d+14:47:39.438 SET FEATURES [Disable
> write cache] ef 02 00 00 00 00 e0 00 21d+14:44:35.625 SET FEATURES
> [Enable write cache] ef 42 fe 00 00 00 e0 00 21d+09:37:08.125 SET
> FEATURES [Enable AAM]
--
Alex.
From: Alexander Zarochentsev <[EMAIL PROTECTED]>
Have get_exclusive_access() restart transaction before taking r/w semaphore.
There are several places in write_unix_file and extent_balance_dirty_pages
where transaction may be open before calling get_exclusive_access. It triggers
the "deadlock detection" BUG_ON inside get_exclusive_access().
This patch fixes the bug by embedding txn_restart into the
get_exclusive_access() code and cleanes up other places where txn_restart() was
called right before get_eclusive_access().
Signed-off-by: [EMAIL PROTECTED]
fs/reiser4/plugin/file/file.c | 13 -------------
fs/reiser4/plugin/file/tail_conversion.c | 5 ++---
2 files changed, 2 insertions(+), 16 deletions(-)
Index: linux-2.6.16-rc4-mm2/fs/reiser4/plugin/file/file.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/fs/reiser4/plugin/file/file.c
+++ linux-2.6.16-rc4-mm2/fs/reiser4/plugin/file/file.c
@@ -1451,9 +1451,6 @@ static int commit_file_atoms(struct inod
int result;
unix_file_info_t *uf_info;
- /* close current transaction */
- txn_restart_current();
-
uf_info = unix_file_inode_data(inode);
/*
@@ -2174,7 +2171,6 @@ append_and_or_overwrite(hint_t * hint, s
done_lh(&hint->lh);
if (!exclusive) {
drop_nonexclusive_access(uf_info);
- txn_restart_current();
get_exclusive_access(uf_info);
}
result = tail2extent(uf_info);
@@ -2964,15 +2960,6 @@ int delete_object_unix_file(struct inode
unix_file_info_t *uf_info;
int result;
- /*
- * transaction can be open already. For example:
- * writeback_inodes->sync_sb_inodes->reiser4_sync_inodes->
- * generic_sync_sb_inodes->iput->generic_drop_inode->
- * generic_delete_inode->reiser4_delete_inode->delete_object_unix_file.
- * So, restart transaction to avoid deadlock with file rw semaphore.
- */
- txn_restart_current();
-
if (inode_get_flag(inode, REISER4_NO_SD))
return 0;
Index: linux-2.6.16-rc4-mm2/fs/reiser4/plugin/file/tail_conversion.c
===================================================================
--- linux-2.6.16-rc4-mm2.orig/fs/reiser4/plugin/file/tail_conversion.c
+++ linux-2.6.16-rc4-mm2/fs/reiser4/plugin/file/tail_conversion.c
@@ -20,13 +20,12 @@ void get_exclusive_access(unix_file_info
assert("nikita-3047", LOCK_CNT_NIL(inode_sem_w));
assert("nikita-3048", LOCK_CNT_NIL(inode_sem_r));
/*
- * "deadlock detection": sometimes we commit a transaction under
+ * "deadlock avoidance": sometimes we commit a transaction under
* rw-semaphore on a file. Such commit can deadlock with another
* thread that captured some block (hence preventing atom from being
* committed) and waits on rw-semaphore.
*/
- assert("nikita-3361", get_current_context()->trans->atom == NULL);
- BUG_ON(get_current_context()->trans->atom != NULL);
+ txn_restart_current();
LOCK_CNT_INC(inode_sem_w);
down_write(&uf_info->latch);
uf_info->exclusive_use = 1;