[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

Daniel Gerzhoy via gem5-users Mon, 22 Jun 2020 11:27:23 -0700

Hey Matt,

Happy to do that if you think it's viable, but I have to say my workaround
is pretty hack-y. There are definitely some benchmark changes on top of
changes to the simulator.


Let me describe it for you and then if you still think it's a good idea
I'll make a patch.

My workaround relies on the fact that:
1. Launching a kernel sets up a completion signal that
hipDeviceSynchronize() ultimately waits on.
2. All you need in the benchmark is that completion signal to know that
your kernel is complete.

So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in
hsa_packet_processor.cc and gpu_command_processor.cc to receive commands
from the benchmark directly.
One of the commands is to steal the completion signal for a particular
kernel and pass it back to the benchmark.

After you launch the kernel (normally) you pass in that kernel's id (you
have to keep track) then send a command to steal the completion signal. It
gets passed back in the return_address member of the agent packet.

In the benchmark I store that signal and use it to do a
hsa_signal_wait_relaxed on it.
And you have to do this every time you launch a kernel, you could
conceivably overload the hipDeviceSynchronize() function to do this for
you/keep track of kernel launches too.

Let me know if you think this is still something you guys want.

Cheers,

Dan


On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair <mattdsincl...@gmail.com>
wrote:

> Hi Dan,
>
> Do you mind pushing your workaround with the completion signal as a patch
> to the staging branch so we can take a look?  Or is this just a change to
> the program(s) itself?
>
> After Kyle's fix (which has been pushed as an update to my patch), we're
> still seeing some hipDeviceSynchronize failures.  So we're interested in
> looking at what you did to see if it solves the problem.
>
> Matt
>
> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty <kroa...@wisc.edu> wrote:
>
>> Hi Dan,
>>
>> Another thing to try is to add and set the environment variable HIP_DB in
>> apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
>> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
>> hipDeviceSynchronize() calls for the applications I've tested.
>>
>> I had traced through this issue (or at least one that manifests the same
>> way) a while back, and what I remember is that the crash happens somewhere
>> in the HIP code, and it occurs because somewhere much earlier we go down a
>> codepath that doesn't clear a register (I believe that was also in HIP
>> code). That register then gets re-used until the error propagates to the
>> register used in the ld instruction. Unfortunately, I had a hard time of
>> getting consistent, manageable traces, so I wasn't able to figure out why
>> we were going down the wrong codepath.
>>
>> Kyle
>> ------------------------------
>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>> *Sent:* Friday, June 19, 2020 2:08 PM
>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> *Cc:* GAURAV JAIN <gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>;
>> gem5 users mailing list <gem5-users@gem5.org>
>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>
>> Thanks Dan.  Kyle R. has found some things about the patch that we're
>> testing and may need to be pushed pending those results.  Fingers crossed
>> that fix will help you too.
>>
>> As Gaurav mentioned previously, the spin flag did not always solve the
>> problem for us -- seems like that is true for you too, although I don't
>> remember square ever failing for us.
>>
>> I don't know exactly where that PC is coming from, I'd have to get a
>> trace.  But I suspect it's actually a GPU address being accessed by some
>> instruction that's failing -- in the past when I've seen this kind of
>> issue, it was happening because the kernel boundary was not being respected
>> and code was running that shouldn't have been running yet.  I don't know
>> what your use case is, so it's possible that is not the issue for you -- a
>> trace would be the only way to know for sure.
>>
>> Matt
>>
>> Regards,
>> Matt Sinclair
>> Assistant Professor
>> University of Wisconsin-Madison
>> Computer Sciences Department
>> cs.wisc.edu/~sinclair
>>
>> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> wrote:
>>
>> Hey Matt,
>>
>> Thanks for pushing those changes. I updated the head of the amd staging
>> branch and tried to run square. The time to get into main stays about the
>> same (5min) FYI.
>>
>> But the hipDeviceSynchronize() fails even when I add 
>> hipSetDeviceFlags(hipDeviceScheduleSpin);
>> unfortunately.
>>
>>  panic: Tried to read unmapped address 0x1853e78.
>> PC: 0x7ffff52a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]
>>
>> Is that PC (0x7ffff52a966b( somewhere in the hip code or something?
>> ....oooor in the emulated driver? The line between the simulator code and
>> guest code is kind of blurry to me around there haha.
>>
>> Best,
>>
>> Dan
>>
>> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair <mattdsincl...@gmail.com>
>> wrote:
>>
>> Hi Dan,
>>
>> Thanks for the update.  Apologies for the delay, the patch didn't apply
>> cleanly initially, but I have pushed the patch I promised previously.
>> Since I'm not sure if you're on the develop branch or the AMD staging
>> branch, I pushed it to both (there are some differences in code on the
>> branches, which I hope will be resolved over time as more of the commits
>> from the staging branch are pushed to develop:
>>
>> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
>> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335
>>
>> I have validated that both of them compile, and asked Kyle R to test that
>> both of them a) don't break anything that is expected to work publicly with
>> the GPU and b) hopefully resolve some of the problems (like yours) with
>> barrier synchronization.  Let us know if this solves your problem too --
>> fingers crossed.
>>
>> Thanks,
>> Matt
>>
>> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> wrote:
>>
>>   Matt,
>>
>> It wasn't so much a solution as an explanation. Kyle was running on an r5
>> 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>>
>> The relative difference in clock speed seems to me to be a more
>> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual
>> time before min) than the 8 min (time before main + exit time) I was seeing
>> before.
>>
>> I'll update to the latest branch and see if that speeds me up further.
>> I'm also going to try running on a faster machine as well though that will
>> take some setup-time.
>>
>> Gaurav,
>>
>> Thanks for the tip, that will be helpful in the meantime.
>>
>> Dan
>>
>> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN <gja...@wisc.edu> wrote:
>>
>> Hi,
>>
>> I am not sure if chiming in now would cause any more confusion, but still
>> giving it a try.
>>
>> @Daniel Gerzhoy <daniel.gerz...@gmail.com> - for hipDeviceSynchronize,
>> as Matt mentioned, they are working on a fix and should have it out there.
>> If you want to, can you try this:
>>
>>     hipSetDeviceFlags(hipDeviceScheduleSpin);
>>     for (int k = 1; k < dim; k++) {
>>         hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0,
>> 0);
>>         hipDeviceSynchronize();
>>     }
>>
>> For me, in many cases (not all and in the ones which it didn't work, I
>> got the same error unmapped error as you), this seemed like doing the
>> trick. You should checkout the HEAD and then try this. I am not hoping for
>> it to make any difference but still worth a shot.
>>
>>
>> ------------------------------
>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>> *Sent:* Friday, June 12, 2020 2:14 PM
>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> *Cc:* Kyle Roarty <kroa...@wisc.edu>; GAURAV JAIN <gja...@wisc.edu>;
>> gem5 users mailing list <gem5-users@gem5.org>
>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>
>> Hi Dan,
>>
>> Glad to hear things are working, and thanks for the tips!  I must admit
>> to not quite following what the solution was though -- are you saying the
>> solution is to replace exit(0)/return with m5_exit()?  I thought your
>> original post said the problem was things taking a really long time before
>> main?  If so, it would seem like something else must have been the
>> problem/solution?
>>
>> Coming to your other questions: I don't recall what exactly the root
>> cause of the hipDeviceSynchronize failure is, but I would definitely
>> recommend updating to the current staging branch head first and testing.  I
>> am also hoping to push a fix today to the barrier bit synchronization --
>> most of the hipDeviceSynchronize-type failures I've seen were due to a bug
>> in my barrier bit implementation.  I'm not sure if this will be the
>> solution to your problem or not, but I can definitely add you as a reviewer
>> and/or point you to it if needed.
>>
>> Not sure about the m5op, hopefully someone else can chime in on that.
>>
>> Thanks,
>> Matt
>>
>> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> wrote:
>>
>> I've figured it out.
>>
>> To measure the time it took to get to main() I put a *return 0; *at the
>> beginning of the function so I wouldn't have to babysit it.
>>
>> I didn't consider that it would also take some time for the simulator to
>> exit, which is where the extra few minutes comes from.
>> Side-note: *m5_exit(0);* instead of a return exits immediately.
>>
>> 5 min is a bit more reasonable of a slowdown for the difference between
>> the two clocks.
>>
>> Two incidental things:
>>
>> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps
>> while it's printing stuff?
>> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes
>> (panic: Tried to read unmapped address 0xff0000c29f48.). Has this been
>> fixed since?
>>
>> I'm going to update to the head of this branch soon, and eventually to
>> the main branch. If it hasn't been fixed I've created a workaround by
>> stealing the completion signal of the kernel based on its launch id, and
>> manually waiting for it using the HSA interface.
>> Happy to help out and implement this as a m5op (or something) if that
>> would be helpful for you guys.
>>
>> Best,
>>
>> Dan
>>
>> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair <mattdsincl...@gmail.com>
>> wrote:
>>
>> I don't see anything amazingly amiss in your output, but the number of
>> times the open/etc. fail is interesting -- Kyle do we see the same thing?
>> If not, it could be that you should update your apu_se.py to point to the
>> "correct" place to search for the libraries first?
>>
>> Also, based on Kyle's reply, Dan how long does it take you to boot up
>> square?  Certainly a slower machine might take longer, but it does seem
>> even slower than expected.  But if we're trying the same application, maybe
>> it will be easier to spot differences.
>>
>> I would also recommend updating to the latest commit on the staging
>> branch -- I don't believe it should break anything with those patches.
>>
>> Yes, looks like you are using the release version of ROCm -- no issues
>> there.
>>
>> Matt
>>
>>
>>
>> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy <daniel.gerz...@gmail.com>
>> wrote:
>>
>> I am using the docker, yeah.
>> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
>> GHz) which might make up some of the difference, the r5 3600 has a faster
>> clock (3.6-4.2 GHz).
>>
>> I've hesitated to update my branch because in the Dockerfile it
>> specifically checks this branch out and applies a patch, though the patch
>> isn't very extensive.
>> This was from a while back (November maybe?) and I know you guys have
>> been integrating things into the main branch (thanks!)
>> I was thinking I would wait until it's fully merged into the mainline
>> gem5 branch and rebase onto that and try to merge my changes in.
>>
>> Last I checked the GCN3 stuff is in the dev branch not the master right?
>>
>> But if it will help maybe I should update to the head of this branch.
>> Will I need to update the docker as well?
>>
>> As for the debug vs release rocm I think I'm using the release version.
>> This is what the dockerfile built:
>>
>> ARG rocm_ver=1.6.2
>> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>> <http://repo.radeon.com/rocm/archive/apt_$%7Brocm_ver%7D.tar.bz2> \
>>     | tar -xjv \
>>     && cd apt_${rocm_ver}/pool/main/ \
>>     && dpkg -i h/hsakmt-roct-dev/* \
>>     && dpkg -i h/hsa-ext-rocr-dev/* \
>>     && dpkg -i h/hsa-rocr-dev/* \
>>     && dpkg -i r/rocm-utils/* \
>>     && dpkg -i h/hcc/* \
>>     && dpkg -i h/hip_base/* \
>>     && dpkg -i h/hip_hcc/* \
>>     && dpkg -i h/hip_samples/*
>>
>>
>> I ran a benchmark that prints that it entered main and returns
>> immediately, this took 9 minutes.
>> I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
>> There's a lot of weird things going on, "syscall open: failed", "syscall
>> brk: break point changed to [...]", and lots of ignored system calls.
>>
>> head of Stats for reference:
>> ---------- Begin Simulation Statistics ----------
>> sim_seconds                                  0.096192
>>   # Number of seconds simulated
>> sim_ticks                                 96192368500
>>   # Number of ticks simulated
>> final_tick                                96192368500
>>   # Number of ticks from beginning of simulation (restored from checkpoints
>> and never reset)
>> sim_freq                                 1000000000000
>>     # Frequency of simulated ticks
>> host_inst_rate                                 175209
>>   # Simulator instruction rate (inst/s)
>> host_op_rate                                   338409
>>   # Simulator op (including micro ops) rate (op/s)
>> host_tick_rate                              175362515
>>   # Simulator tick rate (ticks/s)
>> host_mem_usage                                1628608
>>   # Number of bytes of host memory used
>> host_seconds                                   548.53
>>   # Real time elapsed on the host
>> sim_insts                                    96108256
>>   # Number of instructions simulated
>> sim_ops                                     185628785
>>   # Number of ops (including micro ops) simulated
>> system.voltage_domain.voltage                       1
>>   # Voltage in Volts
>> system.clk_domain.clock                          1000
>>   # Clock period in ticks
>>
>> Maybe something in the attached file explains it better than I can
>> express.
>>
>> Many thanks for your help and hard work!
>>
>> Dan
>>
>>
>>
>>
>>
>> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty <kroa...@wisc.edu> wrote:
>>
>> Running through a few applications, it took me about 2.5 minutes or less
>> each time using docker to start executing the program on an r5 3600.
>>
>> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
>> took about 1-1.5 mins.
>>
>> I ran conv_bench and rnn_bench from DeepBench which took just about 2
>> minutes.
>>
>> Because of that, it's possible the size of the app has an effect on setup
>> time, as the HIP apps are extremely small.
>>
>> Also, the commit Dan is checked out on is d0945dc
>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>
>>  mem-ruby:
>> add cache hit/miss statistics for TCP and TCC
>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>,
>> which isn't the most recent commit. I don't believe that that would account
>> for such a large slowdown, but it doesn't hurt to try the newest commit
>> unless it breaks something.
>>
>> Kyle
>> ------------------------------
>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>> *Sent:* Thursday, June 11, 2020 1:15 AM
>> *To:* gem5 users mailing list <gem5-users@gem5.org>
>> *Cc:* Daniel Gerzhoy <daniel.gerz...@gmail.com>; GAURAV JAIN <
>> gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>
>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>
>> Gaurav & Kyle, do you know if this is the case?
>>
>> Dan, I believe the short answer is yes although 7-8 minutes seems a
>> little long.  Are you running this in Kyle's Docker, or separately?  If in
>> the Docker, that does increase the overhead somewhat, so running it
>> directly on a system would likely reduce the overhead somewhat.  Also, are
>> you running with the release or debug version of the ROCm drivers?  Again,
>> debug version will likely add some time to this.
>>
>> Matt
>>
>> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>> I've been running simulations using the GCN3 branch:
>>
>> rocm_ver=1.6.2
>> $git branch
>>    * (HEAD detached at d0945dc)
>>       agutierr/master-gcn3-staging
>>
>> And I've noticed that it takes roughly 7-8 minutes to get to main()
>>
>> I'm guessing that this is the simulator setting up drivers?
>> Is that correct? Is there other stuff going on?
>>
>> *Has anyone found a way to speed this up? *
>>
>> I am trying to get some of the rodinia benchmarks from the HIP-Examples
>> running and debugging takes a long time as a result.
>>
>> I suspect that this is unavoidable but I won't know if I don't ask!
>>
>> Cheers,
>>
>> Dan Gerzhoy
>> _______________________________________________
>> gem5-users mailing list -- gem5-users@gem5.org
>> To unsubscribe send an email to gem5-users-le...@gem5.org
>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>>
>>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

Reply via email to