bug in 2.0.36/38 scsi cdrom driver

Alexander Bluhm Sun, 31 Oct 1999 09:46:29 -0800
There is a bug in Linux 2.0.36 scsi cdrom driver.
2.0.38 has the same bug and the suspicious code of 2.2.13 looks similar
so I think it has not been fixed yet.
This bug-report refers to 2.0.36

I have a data CD-R which my cdrom-drive can't read, the error correction
fails.
I can mount this CD-R because the inner tracks are o.k.
But when I read the files in certain directories, the kernel prints 
a lot of error messages and sometimes a register/stack dump.
When this happens the system hangs, nothing works.
The error is quite reproduceable, it is not possible to read
the whole CD-R without a system crash.

I compiled my kernel with the -g -DDEBUG options to see what happens
here is the kernel output to the console:


... a lot of scsi error messages
command : 08  04  b8  96  0e  00  00  00  14  42
internal_cmnd (host = 0, channel = 0, target = 5, command = 00e84234,
buffer = 0001d000,
bufflen = 28627, done = 0481229c)
queuecommand : routine at 00188608)
leaving internal_cmnd()
Leaving scsi_do_cmd()
In scsi_done(host = 0, result = 000002)
Calling done function - at address 0481229c
sr.c done: 28000002 225a400
scsi_free 0001d200 1024
scsi_free 0001d000 512
(2281998 36 0) stack segment: 0000
CPU:    0
EIP:    0010:[<04812616>]
EFLAGS: 00010286
eax: 00000000   ebx: 00000000   ecx: 0000232c   edx: 00000000
esi: 00e84214   edi: 00000000   ebp: c800130c   esp: 001ab0c0
ds: 0018   es: 0018   fs: 002b   gs: 0000   ss: 0018
Process swapper (pid: 0, process nr: 0, stackpage=001a9254)
Stack: 00000a93 00e84214 00000000 00000028 0009712e 0009e2b8 0001d200
0000001c
       0001d000 00000a93 00000004 00000000 00000038 28000002 0017810c
00000000
       002b8024 002b8068 00000246 00000246 001ab17c 002b8068 00000000
002b8068
Call Trace: [<0017810c>] [<0017c4b6>] [<00183690>] [<0010cd45>]
[<0010cb53>] [<001623b6>] [<00109ab5>]
       [<00109b23>] [<0010aae5>] [<0010971c>] [<001094d8>] [<00117df8>]
[<00111ff8>]
Code: 83 7c 15 04 00 74 12 8b 44 15 08 50 8b 44 15 00 50 e8 50 63
Aiee, killing interrupt handler
kfree of non-kmalloced memory: 001ab29c, next= 00000100, order=1749660
kfree of non-kmalloced memory: 001ab28c, next= 001ab29c, order=1749660
kfree of non-kmalloced memory: 001ab7a0, next= 001ab28c, order=1749660
idle task may not sleep
idle task may not sleep
idle task may not sleep
idle task may not sleep
idle task may not sleep


That's what ksymoops says about the dump:

Using `System.map' to map addresses to symbols.

Trace: 17810c <scsi_done+7e4/7f0>
Trace: 17c4b6 <aic7xxx_done_cmds_complete+3a/4c>
Trace: 183690 <do_aic7xxx_isr+5c/74>
Trace: 10cd45 <do_IRQ+65/88>
Trace: 10cb53 <IRQ15_interrupt+5f/84>
Trace: 1623b6 <apm_do_idle+42/80>
Trace: 109ab5 <hard_idle+29/5c>
Trace: 109b23 <sys_idle+3b/70>
Trace: 10aae5 <system_call+55/7c>
Trace: 10971c <init>
Trace: 1094d8 <start_kernel+1d4/1e0>
Trace: 117df8 <it_real_fn>
Trace: 111ff8 <schedule+234/28c>
Code: 
Code:  83 7c 15 04 00   cmpl   $0x0,0x4(%ebp,%edx,1)
Code:  74 12            je     19 <_EIP+0x19>
Code:  8b 44 15 08      movl   0x8(%ebp,%edx,1),%eax
Code:  50               pushl  %eax
Code:  8b 44 15 00      movl   0x0(%ebp,%edx,1),%eax
Code:  50               pushl  %eax
Code:  e8 50 63 00 90   call   90006366 <_EIP+0x90006366>
Code:  90               nop    
Code:  90               nop    


The error happens in the sr_mod.o module
I disassembled it with objdump -S and found this section:


        for(i=0; i<SCpnt->use_sg; i++) {
     5ff:       31 db           xorl   %ebx,%ebx
     601:       66 85 c0        testw  %ax,%ax
     604:       74 32           je     638 <rw_intr+0x3a0>
     606:       89 f6           movl   %esi,%esi
            if (sgpnt[i].alt_address) {
     608:       8d 04 5b        leal   (%ebx,%ebx,2),%eax
     60b:       8d 14 85 00 00  leal   0x0(,%eax,4),%edx
     610:       00 00 
     612:       83 7c 15 04 00  cmpl   $0x0,0x4(%ebp,%edx,1)
     617:       74 12           je     62b <rw_intr+0x393>
                scsi_free(sgpnt[i].address, sgpnt[i].length);
     619:       8b 44 15 08     movl   0x8(%ebp,%edx,1),%eax
     61d:       50              pushl  %eax
     61e:       8b 44 15 00     movl   0x0(%ebp,%edx,1),%eax
     622:       50              pushl  %eax
     623:       e8 fc ff ff ff  call   624 <rw_intr+0x38c>
            }
     628:       83 c4 08        addl   $0x8,%esp
     62b:       43              incl   %ebx
     62c:       8b 4c 24 3c     movl   0x3c(%esp,1),%ecx
     630:       0f b7 41 42     movzwl 0x42(%ecx),%eax
     634:       39 c3           cmpl   %eax,%ebx
     636:       7c d0           jl     608 <rw_intr+0x370>
        }

The kenel crashes in 
            if (sgpnt[i].alt_address) {
more precisely in
     612:       83 7c 15 04 00  cmpl   $0x0,0x4(%ebp,%edx,1)

This code can be found in drivers/scsi/sr.c, line 273, function rw_intr

The adresses in the register can be maped to the linenumbers in the
objdump
by EIP-04812004, because the kernel prints 
Calling done function - at address 0481229c
and that is function 
00000298 <rw_intr>:
static void rw_intr (Scsi_Cmnd * SCpnt)
{
     298:       83 ec 28        subl   $0x28,%esp
The difference between 0481229c and 298 is 04812004 so
EIP:    0010:[<04812616>] means line 612:


I think the problem is, that in drivers/scsi/sr.c, line 198, function
rw_intr
        if (good_sectors > 0)
    { /* Some sectors were read successfully. */
the variable good_sectors is 2 and the kernel executes this part of code
where it calls
                    scsi_free(sgpnt[i].address, sgpnt[i].length);
                    scsi_free(SCpnt->buffer, SCpnt->sglist_len);  /*
Free list of scatter-gather pointers */
and
                good_sectors -= 2;

After that good_sectors is null because in line 245
                printk("(%x %x %x) ",SCpnt->request.bh,
SCpnt->request.nr_sectors,
                       good_sectors);
prints (2281998 36 0) to the console.

Then the code after line 264
    if (good_sectors == 0) {
        /* We only come through here if no sectors were read
successfully. */
is also executed but the memory of sgpnt[i].alt_address in line 273
            if (sgpnt[i].alt_address) {
has already been freed and so the kernel crashes.


I have already tried to change if (good_sectors == 0) to an else
but then the kernel crashes in sr.c at line 351
        printk("SCSI CD error : host %d id %d lun %d return code =
%03x\n",
              
scsi_CDs[DEVICE_NR(SCpnt->request.rq_dev)].device->host->host_no,
               scsi_CDs[DEVICE_NR(SCpnt->request.rq_dev)].device->id,
               scsi_CDs[DEVICE_NR(SCpnt->request.rq_dev)].device->lun,
               result);
when accessing host_no with a general protection: 0000


AMD K6 200 MHz processor
Asus TX97 Mainboard
Adaptec AHA-2940UW scsi controller
Pioneer Super12x scsi cdrom

(from /proc/pci):
  Bus  0, device  10, function  0:
    SCSI storage controller: Adaptec AIC-7881U (rev 0).
      Medium devsel.  Fast back-to-back capable.  IRQ 15.  Master
Capable.  Latency=32.  Min Gnt=8.Max Lat=8.
      I/O at 0xd400.
      Non-prefetchable 32 bit memory at 0xe4000000.

(from /proc/scsi/scsi):
Host: scsi0 Channel: 00 Id: 05 Lun: 00
  Vendor: PIONEER  Model: CD-ROM DR-U12X   Rev: 1.06
  Type:   CD-ROM                           ANSI SCSI revision: 02


Please send me an e-mail if you need additional information.


                        Alexander

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to [EMAIL PROTECTED]
bug in 2.0.36/38 scsi cdrom driver

Reply via email to