[gem5-users] Re: SLICC: Main memory overwhelmed by requests?

2020-08-24 Thread tolausso--- via gem5-users
Thank you (once again) for your helpful answers, Jason!

After having done some more experimenting following your suggestions, I've 
found that increasing the deadlock threshold (by several orders of magnitude) 
does not make the problem go away, nor does only increasing the number of 
memory channels to 4. Nonetheless I suspect your gut feeling that bandwidth 
problems are to be blamed still holds, as increasing the DRAM size to 8192MB 
makes the "deadlock" go away and is accompanied by the following message:

> warn: Physical memory size specified is 8192MB which is greater than 3GB.  
> *Twice the number of memory controllers would be created.*

More memory controllers => less congestion in the memory controller(s) – makes 
sense to me!

I wonder if the underlying issue has more to do with the cache hierarchy of my 
system (e.g. no L2 cache, only small L1's) than the protocol itself. Either 
way, having found a band-aid solution is good enough for my current purposes :)

Thanks again for your help Jason!

Best,
Theo
___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: SLICC: Main memory overwhelmed by requests?

2020-08-21 Thread Jason Lowe-Power via gem5-users
Hi Theo,

It's possible that if you increase the deadlock timeout your protocol will
"just work". There's an infinite queue between the memory controller
(DRAMCtrl) and the Ruby directory (which sends the memory requests to the
memory controller). We've made some progress to correctly model
backpressure there, but in some circumstances, the queue size can grow very
large (i.e., when the request bandwidth far exceeds the memory's
bandwidth). The fact that your protocol works with more channels (i.e.,
more bandwidth) makes me suspect this is the problem.

As far as debugging... memory requests are (usually) sent from a Ruby
directory back into gem5's "normal" memory system. They are sent via a
special message buffer, always called "requestToMemory". In the
AbstractController::serviceMemoryQueue() function, this message buffer is
checked and if it's non-empty a new packet is created and sent across the
directory's (or whatever the state machine is) "memory" port.

On the DRAM side, you can use the "MemAccess" debug flag to see when the
memory is *functionally* accessed and "DRAM" for the DRAM transactions.
Finally, you might want to use the "PacketQueue" debug flag because there
is an (infinite) QueuedPort between the memory and your Ruby controllers.

Hopefully this helps track down the problem. Let us know if you have more
questions! This is a complicated code path.

Cheers,
Jason


On Fri, Aug 21, 2020 at 8:56 AM tolausso--- via gem5-users <
gem5-users@gem5.org> wrote:

> Hi all,
>
> I am trying to run a Linux kernel in FS mode, with a custom-rolled
> SLICC/Ruby directory-based cache coherence protocol, but it seems like the
> memory controller is dropping some requests in rare circumstances --
> possibly due to it being overwhelmed with requests.
>
> The protocol seems to work fine for a long time but about 90% of the way
> into booting the kernel, around the same time as the "mounting
> filesystems..." message appears, gem5 crashes and reports a deadlock.
> Inspecting the trace, it seems that the deadlock occurs during a period of
> very high main memory traffic; the trace looks something like this:
> > Directory receives DMA read request for Address 1, sends MEMORY_READ to
> memory controller
> > Directory receives DMA read request for Address 2, sends MEMORY_READ to
> memory controller
> > ...
> >  Directory receives DMA read request for Address N, sends MEMORY_READ to
> memory controller
> > Directory receives CPU read request for Address A, sends MEMORY_READ to
> memory controller
>
> After some time, the Directory receives responses for all of the
> DMA-induced requests (Address 1...N). However, it never hears back about
> the MEMORY_READ to Address A, and so eventually gem5 calls it a day and
> reports a deadlock. Address A is distinct from addresses 1..N and its read
> should therefore not be affected by the requests to the other addresses.
>
> I have tried:
> * Using the same kernel with one of the example SLICC protocols
> (MOESI_CMP_directory). No error occurred, so the underlying issue must be
> with my protocol.
> * Upping the memory size to 8192MB (from 512MB) and increasing the number
> of channels to 4 (from 1). Under this configuration the above issue does
> not occur, and the Linux kernel happily finishes booting. This combined
> with the fact that it takes so long for any issues to occur makes me think
> that my protocol is somehow overwhelming the memory controller, causing it
> to drop the request to read Address A. In other words, I am pretty
> confident that the error is not something as simple as forgetting to pop
> the memory queue, for example.
>
> If anyone has any clues as to what might be going on I would very much
> appreciate your comments.
> I was especially wondering about the following:
> * Is it even possible for requests to main memory to fail due to for
> example network congestion? If so, is there any way to catch this and retry
> the request?
> * (Noob question): Where in gem5 do the main memory requests "go to"? Is
> there a debugging flag I could use to check whether the main memory
> receives the request?
>
> Best,
> Theo Olausson
> Univ. of Edinburgh
> ___
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>
___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s