(Resending as my email got bounced from the gem5 mailing list) Hi Brad & Tony,
Gaurav (CC'd) and I have been attempting to run the Pannotia benchmarks on the amd-gcn3-staging branch. We've hipified all of them, and tested that they all work on "real" GPUs, then removed the copies before running them in the simulator. However, we're having a problem with several of the benchmarks (FloydWarshall, Color) where they are getting Invalid DMA Transition deadlocks in the MOESI_AMD_Base-dir protocol. We've gotten a trace and done a bunch of digging, and identified the following pattern as causing the deadlock: *tl;dr: There is a deadlock in the MOESI_AMD_Base-dir protocol that relates to a race between a CPU load and a DMA load. Is there documentation of how the DMA should behave? Shouldn't the directory always wake up any pending requests when the thing it was pending on completes?* 1. DMA issues a read for Block A 2. Directory receives 1, initiates Invalidation probe + sends a read to memory because the data is currently not in any of the caches. 3. The CPU (CorePair) issues a read for Block A (before the invalidation is received from 2) since it misses. CPU transitions from I --> I_EOS. 4. The directory receives 3, but because it is in the middle of BDR_PM, stalls the CPU's read request and puts it on the stall buffer -- https://gem5.googlesource.com/amd/gem5/+/refs/heads/agutierr/master-gcn3-staging/src/mem/protocol/MOESI_AMD_Base-dir.sm#1086 . 5. CPU receives Invalidation request, acknowledges but stays in I_EOS - this strikes me as strange, wouldn't we want to go to I? Or at least to another transient state that will transition to I when everything is done? But this is not the key issue here ... 6. Directory receives memory response, transitions from BDR_PM --> BDR_Pm 7. (GPU/other CPU receive invalidation and respond appropriately) 8. Remaining invalidations received by Directory, it transitions from BDR_Pm --> U. - This strikes me as problematic, and I believe is what the source of the deadlock is. Shouldn't we wake up the things on the stallbuffer now? Since the request before it just completed? In any event, that is not what happens right now ( https://gem5.googlesource.com/amd/gem5/+/refs/heads/agutierr/master-gcn3-staging/src/mem/protocol/MOESI_AMD_Base-dir.sm#1332), which leads to a deadlock. There are other cases (e.g., when 1 and 3 are inverted), where the stallbuffer will be woken up (I believe it's this one: https://gem5.googlesource.com/amd/gem5/+/refs/heads/agutierr/master-gcn3-staging/src/mem/protocol/MOESI_AMD_Base-dir.sm#1169, but Gaurav can correct me if I grabbed the wrong one). We tried making the "obvious" changes: - On BDR_Pm --> U transition, call . We added it before deallocating the TBE in this transition ( https://gem5.googlesource.com/amd/gem5/+/refs/heads/agutierr/master-gcn3-staging/src/mem/protocol/MOESI_AMD_Base-dir.sm#1334), because that's what the transition linked above did on 1169-1172). - This lead to an Invalid Transition for the next DMA request on the same address, because the B_PM state doesn't have a transition for when a DMA request arrives. So we added that in, putting the DMA request in the stall buffer just like the other case did for the CPU read. But then we get an assert failure in the message buffer -- isReady() in stallMessage (which gets called from st_stallAndWaitRequest) fails because the message is enqueued later. When we got to this point, I wasn't sure of what the next step should be. So, we were wondering: - Is this a bug you are aware of internally? - Do you know why the stall buffer would not be woken up in step 8 above? - Do you have any publicly available documentation about the DMA requests, and/or how this situation should be handled? - Do you see anything wrong with the logic of the changes above? Since clearly making them did not immediately solve the problem. - I did notice that the mainline AMD_MOESI_Base-dir.sm file does not have these DMA transitions, but I'm guessing the intent is for your branch to eventually take that files place, and not vice-versa where it has a fix you don't have on your staging branch? Any help you could provide would be greatly appreciated! Regards, Matt Sinclair Assistant Professor University of Wisconsin-Madison Computer Sciences Department cs.wisc.edu/~sinclair _______________________________________________ gem5-dev mailing list gem5-dev@gem5.org http://m5sim.org/mailman/listinfo/gem5-dev