Re: [m5-dev] Ruby FS - DMA Controller problem?

Beckmann, Brad Mon, 14 Mar 2011 14:33:06 -0700

Hi Malek,

Just to reiterate, I don't think my patches will fix the underlining problem.  
Instead, my patches just fix various corner cases in the protocols.  I suspect 
these corner cases are never actually reached in real execution.


The fact that your dma traces point out that the Ruby and Classic 
configurations use different base addresses makes me think this might be a 
problem with configuration and device registration.  We should investigate 
further.

Brad


> -----Original Message-----
> From: Malek Musleh [mailto:malek.mus...@gmail.com]
> Sent: Monday, March 14, 2011 9:11 AM
> To: M5 Developer List
> Cc: Beckmann, Brad
> Subject: Re: [m5-dev] Ruby FS - DMA Controller problem?
> 
> Hi Korey/Brad,
> 
> I commented out the following lines:
> 
> In RubyPort.hh
> 
>  unsigned deviceBlockSize() const;
> 
> In RubyPort.cc
> 
> unsigned
> RubyPort::M5Port::deviceBlockSize() const {
>     return (unsigned) RubySystem::getBlockSizeBytes(); }
> 
> I also did a diff trace between M5 and Ruby using the IdeDisk traceflag as
> indicated earlier on.
> 
> In the Ruby Trace, it stalls at this
> 
> 2398589225000: system.disk0: Write to disk at offset: 0x1 data 0
> 2398589400000: system.disk0: Write to disk at offset: 0x2 data 0x10
> 2398589575000: system.disk0: Write to disk at offset: 0x3 data 0
> 2398589742000: system.disk0: Write to disk at offset: 0x4 data 0
> 2398589909000: system.disk0: Write to disk at offset: 0x5 data 0
> 2398590088000: system.disk0: Write to disk at offset: 0x6 data 0xe0
> 2398596763500: system.disk0: Write to disk at offset: 0x7 data 0xc8
> 2398597916500: system.disk0: PRD: baseAddr:0x87298000 (0x7298000)
> byteCount:8192 (16) eot:0x8000 sector:0
> 2398597916500: system.disk0: doDmaWrite, diskDelay: 1000000
> totalDiskDelay: 1000016
> 
> Waiting for the Interrupt to be Posted.
> 
> However, a comparison between the M5 and Ruby traces suggest that they
> differ on the following line:
> 
> RubyTrace:
> 
> 2398589400000: system.disk0: Write to disk at offset: 0x2 data 0x10
> 2398589575000: system.disk0: Write to disk at offset: 0x3 data 0
> 2398589742000: system.disk0: Write to disk at offset: 0x4 data 0
> 2398589909000: system.disk0: Write to disk at offset: 0x5 data 0
> 2398590088000: system.disk0: Write to disk at offset: 0x6 data 0xe0
> 2398596763500: system.disk0: Write to disk at offset: 0x7 data 0xc8
> 2398597916500: system.disk0: PRD: baseAddr:0x87298000 (0x7298000)
> byteCount:8192 (16) eot:0x8000 sector:0
> 2398597916500: system.disk0: doDmaWrite, diskDelay: 1000000
> totalDiskDelay: 1000016
> 
> 
> M5 Trace:
> 
> 2237623634000: system.disk0: Write to disk at offset: 0x7 data 0xc8
> 2237624206501: system.disk0: PRD: baseAddr:0x87392000 (0x7392000)
> byteCount:8192
>  (16) eot:0x8000 sector:0
> 2237624206501: system.disk0: doDmaWrite, diskDelay: 1000000
> totalDiskDelay: 1000016
> 
> If you note that the PRD:baseAddr it tries to access is different, which I 
> would
> think should be the same right? There is no reason why it should be
> different? The 0 or 1 block size, and the sequential retries are forcing the
> DMA timer to time out the request, and thus fails in the dma inconsistent
> state.
> 
> I have attached both sets of traces in case it sheds anymore light on to the
> cause of the problem.
> 
> In any case, it might not matter too much now since Brad was able to
> reproduce the problem and has a patch for it, but may be of use for future
> M5 changes.
> 
> Malek
> 
> On Mon, Mar 14, 2011 at 11:54 AM, Beckmann, Brad
> <brad.beckm...@amd.com> wrote:
> > Thanks Malek.  Very interesting.
> >
> > Yes, this 5 line changeset seems rather benign, but actually has huge
> ramifications.  With this change, the RubyPort passes the correct block size 
> to
> the cpu/device models.  Without it, I believe the block size defaults to 0 or
> 1...I can't remember which.  While that seems rather inconsequential, I
> noticed when I made this change that the memtester behaved quite
> differently.  In particular, it keeps issuing requests until sendTiming 
> returns
> false, instead of just one request/cpu at a time.  Therefore another patch in
> this series added the retry mechanism to the RubyPort.  I'm still not sure
> exactly what the problem is with ruby+dma, but I suspect that the dma
> devices are behaving differently now that the RubyPort passes the correct
> block size.
> >
> > I was able to spend a few hours on this over the weekend.  I am now able
> to reproduce the error and I have a few protocol bug fixes queued
> up.  However, I don't think those fixes actually solved the main issue.  I 
> don't
> think I'll be able to get to it today, but I'll try to find some time 
> tomorrow to
> investigate further.
> >
> > Brad
> >
> >
> >> -----Original Message-----
> >> From: m5-dev-boun...@m5sim.org [mailto:m5-dev-
> boun...@m5sim.org] On
> >> Behalf Of Korey Sewell
> >> Sent: Monday, March 14, 2011 2:10 AM
> >> To: M5 Developer List
> >> Subject: Re: [m5-dev] Ruby FS - DMA Controller problem?
> >>
> >> Which lines are you commenting out to  get it to work? It's a bit
> >> unclear in the diff you point to (maybe because you said it's a full
> >> set of changes, not just
> >> one)
> >>
> >> (btw: The work I've been doing is comparing the "old m5" memory trace
> >> to the "gem5" memory trace to try to chase down the bug. I wouldn't
> >> be surprised if we are converging to the same bug though.)
> >>
> >> On Mon, Mar 14, 2011 at 3:51 AM, Malek Musleh
> >> <malek.mus...@gmail.com> wrote:
> >> > Hi Brad,
> >> >
> >> > I found the problem that was causing this error. Specifically, it
> >> > is this changeset:
> >> >
> >> > changeset:   7909:eee578ed2130
> >> > user:        Joel Hestness <hestn...@cs.utexas.edu>
> >> > date:        Sun Feb 06 22:14:18 2011 -0800
> >> > summary:     Ruby: Fix to return cache block size to CPU for split
> >> > data transfers
> >> >
> >> > Link: http://reviews.m5sim.org/r/393/diff/#index_header
> >> >
> >> > Previously, I mentioned it was a couple of changesets prior to this
> >> > one, but the changes between them are related, so it wasn't as
> >> > obvious what was happening.
> >> >
> >> > In fact, this corresponds to the assert() for the block size you
> >> > had put in to deal with x86 unaligned accesses, but then later
> >> > removed because of LL/SC in Alpha.
> >> >
> >> > It's not clear to me why this is causing a problem, or rather why
> >> > this doesn't return the default 64 byte block size from the ruby
> >> > system, but commenting out those lines of code allowed it to work.
> >> >
> >> > Maybe Korey could confirm?
> >> >
> >> > Malek
> >> >
> >> > On Wed, Mar 9, 2011 at 8:24 PM, Beckmann, Brad
> >> <brad.beckm...@amd.com> wrote:
> >> >> I still have not been able to reproduce the problem, but I haven't
> >> >> tried in a
> >> few weeks.  So does this happen when booting up the system,
> >> independent of what benchmark you are running?  If so, could you send
> >> me your command line?  I'm sure the disk image and kernel binaries
> >> between us are different, so I don't necessarily think I'll be able
> >> to reproduce your problem, but at least I'll be able to isolate it.
> >> >>
> >> >> Brad
> >> >>
> >> >>
> >> >>
> >> >>> -----Original Message-----
> >> >>> From: m5-dev-boun...@m5sim.org [mailto:m5-dev-
> >> boun...@m5sim.org] On
> >> >>> Behalf Of Malek Musleh
> >> >>> Sent: Wednesday, March 09, 2011 4:41 PM
> >> >>> To: M5 Developer List
> >> >>> Subject: Re: [m5-dev] Ruby FS - DMA Controller problem?
> >> >>>
> >> >>> Hi Korey,
> >> >>>
> >> >>> I ran into a similar problem with a different benchmark/boot up
> attempt.
> >> >>> There is another thread on m5-dev with 'Ruby FS failing with
> >> >>> recent changesets' as the subject. I was able to track down the
> >> >>> changeset which it was coming from, but I did not look further
> >> >>> into the changeset as to why it was causing it.
> >> >>>
> >> >>> Brad said he would take a look at it, but I am not sure if he was
> >> >>> able to reproduce the problem.
> >> >>>
> >> >>> Malek
> >> >>>
> >> >>> On Wed, Mar 9, 2011 at 7:08 PM, Korey Sewell <ksew...@umich.edu>
> >> wrote:
> >> >>> > Hi all,
> >> >>> > I'm trying to run Ruby in FS mode for the FFT benchmark.
> >> >>> >
> >> >>> > However, I've been unable to fully boot the kernel and error
> >> >>> > with a panic in the IDE disk controller:
> >> >>> > panic: Inconsistent DMA transfer state: dmaState = 2 devState =
> >> >>> > 1 @ cycle 62640732569001
> >> >>> >
> >>
> [doDmaTransfer:build/ALPHA_FS_MOESI_CMP_directory/dev/ide_disk.cc,
> >> >>> > line 323]
> >> >>> >
> >> >>> > Has anybody run into a similar error or does anyone have any
> >> >>> > suggestions for debugging the problem? I can run the same code
> >> >>> > using the M5 memory system and FFT finishes properly so it's
> >> >>> > definitely a ruby-specific thing. It seems to track this down ,
> >> >>> > I could diff instruction traces (M5 v. Ruby) or maybe even diff
> >> >>> > trace output from the IdeDisk trace flags but those routes seem
> >> >>> > a bit heavy-handed
> >> >>> considering the amount of trace output generated.
> >> >>> >
> >> >>> > The command line this was run with is:
> >> >>> > build/ALPHA_FS_MOESI_CMP_directory/m5.opt
> >> >>> configs/example/ruby_fs.py
> >> >>> > -b fft_64t_base -n 1
> >> >>> >
> >> >>> > The output in system.terminal is:
> >> >>> > hda: M5 IDE Disk, ATA DISK drive
> >> >>> > hdb: M5 IDE Disk, ATA DISK drive
> >> >>> > hda: UDMA/33 mode selected
> >> >>> > hdb: UDMA/33 mode selected
> >> >>> > hdc: M5 IDE Disk, ATA DISK drive
> >> >>> > hdc: UDMA/33 mode selected
> >> >>> > ide0 at 0x8410-0x8417,0x8422 on irq 31
> >> >>> > ide1 at 0x8418-0x841f,0x8426 on irq 31
> >> >>> > ide_generic: please use "probe_mask=0x3f" module parameter for
> >> >>> > probing all legacy ISA IDE ports
> >> >>> > ide2 at 0x1f0-0x1f7,0x3f6 on irq 14
> >> >>> > ide3 at 0x170-0x177,0x376 on irq 15
> >> >>> > hda: max request size: 128KiB
> >> >>> > hda: 2866752 sectors (1467 MB), CHS=2844/16/63
> >> >>> >  hda:<4>hda: dma_timer_expiry: dma status == 0x65
> >> >>> > hda: DMA interrupt recovery
> >> >>> > hda: lost interrupt
> >> >>> >  unknown partition table
> >> >>> > hdb: max request size: 128KiB
> >> >>> > hdb: 1008000 sectors (516 MB), CHS=1000/16/63
> >> >>> >  hdb:<4>hdb: dma_timer_expiry: dma status == 0x65
> >> >>> > hdb: DMA interrupt recovery
> >> >>> > hdb: lost interrupt
> >> >>> >
> >> >>> > Thanks again, any help or thoughts would be well appreciated.
> >> >>> >
> >> >>> > --
> >> >>> > - Korey
> >> >>> > _______________________________________________
> >> >>> > m5-dev mailing list
> >> >>> > m5-dev@m5sim.org
> >> >>> > http://m5sim.org/mailman/listinfo/m5-dev
> >> >>> >
> >> >>> _______________________________________________
> >> >>> m5-dev mailing list
> >> >>> m5-dev@m5sim.org
> >> >>> http://m5sim.org/mailman/listinfo/m5-dev
> >> >>
> >> >>
> >> >> _______________________________________________
> >> >> m5-dev mailing list
> >> >> m5-dev@m5sim.org
> >> >> http://m5sim.org/mailman/listinfo/m5-dev
> >> >>
> >> > _______________________________________________
> >> > m5-dev mailing list
> >> > m5-dev@m5sim.org
> >> > http://m5sim.org/mailman/listinfo/m5-dev
> >> >
> >>
> >>
> >>
> >> --
> >> - Korey
> >> _______________________________________________
> >> m5-dev mailing list
> >> m5-dev@m5sim.org
> >> http://m5sim.org/mailman/listinfo/m5-dev
> >
> >
> > _______________________________________________
> > m5-dev mailing list
> > m5-dev@m5sim.org
> > http://m5sim.org/mailman/listinfo/m5-dev
> >

_______________________________________________
m5-dev mailing list
m5-dev@m5sim.org
http://m5sim.org/mailman/listinfo/m5-dev

Re: [m5-dev] Ruby FS - DMA Controller problem?

Reply via email to