[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-15 Thread Matt Sinclair via gem5-users
Hi David,

The dynamic register allocation policy allows the GPU to schedule as many 
wavefronts as there is register space on a CU.  By default, the original 
register allocator released with this GPU model ("simple") only allowed 1 
wavefront per CU at a time because the publicly available dependence modeling 
was fairly primitive.  However, this was not very realistic relative to how a 
real GPU performs, so my group has added better dependence tracking support 
(more could probably still be done, but it reduced stalls by up to 42% relative 
to simple) and a register allocation scheme that allows multiple wavefronts to 
run concurrently per CU ("dynamic").

By default, the GPU model assumes that the simple policy is used unless 
otherwise specified.  I have a patch in progress to change that though: 
https://gem5-review.googlesource.com/c/public/gem5/+/57537.

Regardless, if applications are failing with the simple register allocation 
scheme, I wouldn't expect a more complex scheme to fix the issue.  But I do 
strongly recommend you use the dynamic policy for all experiments - otherwise 
you are using a very simple, less realistic GPU model.

Setting all of that aside, I looked up the perror message you sent last night 
and it appears that happens when your physical machine has run out of memory 
(which means we can't do much to fix gem5, since the machine itself wouldn't 
allocate as much memory as you requested).  So, if you want to run LRN and 
can't run on a machine with more memory, one thing you could do is change the 
LRN config file to use smaller NCHW values (e.g., reduce the batch size, N, 
from 100 to something smaller that fits on your machine): 
https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/DNNMark/config_example/lrn_config.dnnmark#6.
  If you do this though, you will likely need to re-run the generate_cachefile 
to generate the MIOpen binaries for this different sized LRN.

Hope this helps,
Matt

From: David Fong 
Sent: Tuesday, March 15, 2022 2:58 PM
To: Matt Sinclair ; gem5 users mailing list 

Cc: Kyle Roarty ; Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi Matt S.,

Thanks for the detailed reply.

I looked at the link you sent me for the weekly run.

I see an additional parameter which I didn't use:

--reg-alloc-policy=dynamic

What does this do ?

I was able to run the two other tests you use in your weekly runs : 
test_fwd_pool, test_bwd_bn
for CUs=4.

David


From: Matt Sinclair mailto:sincl...@cs.wisc.edu>>
Sent: Monday, March 14, 2022 7:41 PM
To: gem5 users mailing list mailto:gem5-users@gem5.org>>
Cc: David Fong mailto:da...@chronostech.com>>; Kyle 
Roarty mailto:kroa...@wisc.edu>>; Poremba, Matthew 
mailto:matthew.pore...@amd.com>>
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

I have not seen this mmap error before, and my initial guess was the mmap error 
is happening because you are trying to allocate more memory than we created 
when mmap'ing the inputs for the applications (we do this to speed up SE mode, 
because otherwise initializing arrays can take several hours).  However, the 
fact that it is failing in physical.cc and not in the application itself is 
throwing me off there.  Looking at where the failure is occurring, it seems the 
backing store code itself is failing here (from such a large allocation).  
Since the failure is with a C++ mmap call itself, that is perhaps more 
problematic - is "Cannot allocate memory" the failure from the perror() call on 
the line above the fatal() print?

Regarding the other question, and the failures more generally: we have never 
tested with > 64 CUs before, so certainly you are stressing the system and 
encountering different kinds of failures than we have seen previously.

In terms of applications, I had thought most/all of them passed previously, but 
we do not test each and every one all the time because this would make our 
weekly regressions run for a very long time.  You can see here: 
https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/tests/weekly.sh#176
 which ones we run on a weekly basis.  I expect all of those to pass (although 
your comment seems to indicate that is not always true?).  Your issues are 
exposing that perhaps we need to test more of them beyond these 3 - perhaps on 
a quarterly basis or something though to avoid inflating the weekly runtime.  
Having said that, I have not run LRN in a long time, as some ML people told me 
that LRN was not widely used anymore.  But when I did run it, I do remember it 
requiring a large amount of memory - which squares with what you are seeing 
here.  I thought 

[gem5-users] Re: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

2022-03-15 Thread David Fong via gem5-users
Hi Matt S.,

Thanks for the detailed reply.

I looked at the link you sent me for the weekly run.

I see an additional parameter which I didn't use:

--reg-alloc-policy=dynamic

What does this do ?

I was able to run the two other tests you use in your weekly runs : 
test_fwd_pool, test_bwd_bn
for CUs=4.

David


From: Matt Sinclair 
Sent: Monday, March 14, 2022 7:41 PM
To: gem5 users mailing list 
Cc: David Fong ; Kyle Roarty ; 
Poremba, Matthew 
Subject: RE: gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi David,

I have not seen this mmap error before, and my initial guess was the mmap error 
is happening because you are trying to allocate more memory than we created 
when mmap'ing the inputs for the applications (we do this to speed up SE mode, 
because otherwise initializing arrays can take several hours).  However, the 
fact that it is failing in physical.cc and not in the application itself is 
throwing me off there.  Looking at where the failure is occurring, it seems the 
backing store code itself is failing here (from such a large allocation).  
Since the failure is with a C++ mmap call itself, that is perhaps more 
problematic - is "Cannot allocate memory" the failure from the perror() call on 
the line above the fatal() print?

Regarding the other question, and the failures more generally: we have never 
tested with > 64 CUs before, so certainly you are stressing the system and 
encountering different kinds of failures than we have seen previously.

In terms of applications, I had thought most/all of them passed previously, but 
we do not test each and every one all the time because this would make our 
weekly regressions run for a very long time.  You can see here: 
https://gem5.googlesource.com/public/gem5/+/refs/heads/develop/tests/weekly.sh#176
 which ones we run on a weekly basis.  I expect all of those to pass (although 
your comment seems to indicate that is not always true?).  Your issues are 
exposing that perhaps we need to test more of them beyond these 3 - perhaps on 
a quarterly basis or something though to avoid inflating the weekly runtime.  
Having said that, I have not run LRN in a long time, as some ML people told me 
that LRN was not widely used anymore.  But when I did run it, I do remember it 
requiring a large amount of memory - which squares with what you are seeing 
here.  I thought LRN needed -mem-size=32 GB to run, but based on your message 
it seems that is not the case.

@Matt P: have you tried LRN lately?  If so, have you run into the same 
OOM/backing store failures?

I know Kyle R. is looking into your other failure, so this one may have to wait 
behind it from our end, unless Matt P knows of a fix.

Thanks,
Matt

From: David Fong via gem5-users 
mailto:gem5-users@gem5.org>>
Sent: Monday, March 14, 2022 4:38 PM
To: David Fong via gem5-users mailto:gem5-users@gem5.org>>
Cc: David Fong mailto:da...@chronostech.com>>
Subject: [gem5-users] gem5 : X86 + GCN3 (gfx801) + test_fwd_lrn

Hi,

I'm getting an error related to memory for test_fwd_lrn
I increased the memory size from 4GB to 512GB I got memory size issue : "out of 
memory".

build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:599: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_SCRATCH_BACKING_VA
build/GCN3_X86/gpu-compute/gpu_compute_driver.cc:609: warn: unimplemented 
ioctl: AMDKFD_IOC_SET_TRAP_HANDLER
build/GCN3_X86/sim/mem_pool.cc:120: fatal: fatal condition freePages() <= 0 
occurred: Out of memory, please increase size of physical memory.

But once I increased mem size to 1024GB, 1536GB,2048GB I'm getting this DRAM 
device capacity issue.

docker run --rm -v ${PWD}:${PWD} -v 
${PWD}/gem5/gem5-resources/src/gpu/DNNMark/cachefiles:/root/.cache/miopen/2.9.0 
-w ${PWD} gcr.io/gem5-test/gcn-gpu:v21-2 gem5/build/GCN3_X86/gem5.opt 
gem5/configs/example/apu_se.py --mem-size 1536GB --num-compute-units 256 -n3 
--benchmark-root=gem5/gem5-resources/src/gpu/DNNMark/build/benchmarks/test_fwd_lrn
 -cdnnmark_test_fwd_lrn --options="-config 
gem5/gem5-resources/src/gpu/DNNMark/config_example/lrn_config.dnnmark -mmap 
gem5/gem5-resources/src/gpu/DNNMark/mmap.bin" |& tee 
gem5_gpu_cu256_run_dnnmark_test_fwd_lrn_50latency.log
Global frequency set at 1 ticks per second
build/GCN3_X86/mem/mem_interface.cc:791: warn: DRAM device capacity (8192 
Mbytes) does not match the address range assigned (2097152 Mbytes)
mmap: Cannot allocate memory
build/GCN3_X86/mem/physical.cc:231: fatal: Could not mmap 1649267441664 bytes 
for range [0:0x180]!


Smaller number of CUs like 4 also have same type of error.

Is there a regression script or regression log for DNNMark to show mem-size or 
configurations that are known 

[gem5-users] Re: Modelling cache flushing on gem5 (RISC-V)

2022-03-15 Thread Ethan Bannister via gem5-users
Dear Eliot,

This is all invaluable so thank you for taking the time to message.

This message is just my current thinking, so please let me know if I’ve 
misinterpreted anything.

>From what I can now tell, the best way to go is to add a request flag to 
>mem/request.hh, and then issue the request with writeMemTiming from 
>memhelpers.hh. Then as you have done, it should be possible to extend the 
>caches to respond to this request (but in the case of fence.t, up to the point 
>of unification rather than coherence? It seems you can just add the DST_POU 
>flag to the request to achieve this.). You could make each cache visit every 
>block with some added delay depending on your exact modelling. I’ve seen such 
>a thing implemented by functional accesses in BaseCache::memWriteback and 
>BaseCache::memInvalidate, but I am assuming your engine probably does this via 
>timing writebacks on each block. From what I can see, Cache::writebackBlk 
>seems to be timing, and any latency from determining dirty lines (depending on 
>our particular model) could be added to the cycle count.

As for the writeback buffer issue, it seems that given any placement of fence.t 
it should be conceptually valid to say that no channel exists across it. 
Therefore you’d need to ensure the writeback buffer was emptied regardless. Is 
a memory fence able to achieve this or does it require extending the caches 
further? Then, I guess you would need some concept of worst-case execution time 
(as you have said, a fixed maximum), as otherwise fence.t in of itself would 
become a communication channel.

I imagine a basic first implementation could do this functionally, to verify 
everything that should be flushed is, and then made more accurate afterwards.

At this point I’ve got the instruction decoding, and can flush an individual L1 
block, so with respect to caches – I just need to extend the protocol 
appropriately. I would appreciate a high-level, but slightly more detailed 
explanation of the changes you made (particularly the engine) and the functions 
you called to get your implementation working whilst also making it timing 
accurate. Assuming that it is easier to provide than producing a potentially 
quite complicated patch.

Thanks again for your support,

Ethan

From: Eliot Moss 
Sent: 14 March 2022 14:15
To: Ethan Bannister ; gem5-users@gem5.org 

Subject: Re: [gem5-users] Modelling cache flushing on gem5 (RISC-V)

I just skimmed that paper (not surprised to see Gernot Heiser's name there!)
and I think that, while it would be a little bit of work, it might not be
*too* hard to implement something like fence.t for the caches.  It would be
substantially different from wbinvd.  The latter speaks to the whole cache
system, and I implemented it by a request that flows all the way up to the Point
of Coherence (memory bus) and back down as a new kind of snoop to all the
caches that talk through one or more levels to memory.  Then each cache
essentially has a little engine for writing dirty lines back.  It's that part
that would be useful here - I guess we'd be looking at a variation on it,
triggered in a slightly different way (not by a snoop, but by a different kind
of request).  To get sensible timings you'd need to decide what hardware
mechanisms are available for finding dirty lines.  I assumed they were indexed
in some way that finding at least a set with one or more dirty lines had no
substantial overhead.  L1 cache is small enough that we might get by with that
assumption.  Alternatively, assuming each set provides an "at least one dirty
line" bit, and that 64 of the these set bits can be examined by a priority
encoder to give you a set to work on - or indicate that all 64 sets are clean
- then a typical L1 cache would not need many cycles of reading those bits out
to find the relevant sets.

For 64 KB cache, 64 B lines, associativity 2, there are 512 sets, meaning we'd
need to read 8 groups of 64 of these "dirty set" bits.  The actual writing
back would usually take most of the time.

Presumably you would need to wait until all the dirty lines make it to L2,
since if the writeback buffers are clogged there might still be a
communication channel there.  Still, by the time a context switch is complete,
those buffers may be guaranteed to have cleared - provided we can make an
argument that there is a fixed maximum amount of time needed for that to
happen.

Anyway, I hope this helps.

Eliot Moss
___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s