[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

Daniel Gerzhoy via gem5-users Mon, 22 Jun 2020 16:32:04 -0700

Sounds good.

I'll generate a patch for the staging branch in the next few days.


Dan

On Mon, Jun 22, 2020 at 2:50 PM Matt Sinclair <mattdsincl...@gmail.com>
wrote:

> In my opinion, adding support for HSA_PACKET_TYPE_AGENT_DISPATCH,
> irrelevant of the current issues, is worthwhile and helpful to push.
>
> If you have a minimum working example of how you change the benchmark,
> that would be helpful too.
>
> Kyle R. has spent a bunch of time trying to identify the source of the
> problem within the synchronize call, but thus far we haven't found anything
> concrete.  So for now, having this workaround would definitely be helpful
> for the community.
>
> Matt
>
> On Mon, Jun 22, 2020 at 1:25 PM Daniel Gerzhoy <daniel.gerz...@gmail.com>
> wrote:
>
>> Hey Matt,
>>
>> Happy to do that if you think it's viable, but I have to say my
>> workaround is pretty hack-y. There are definitely some benchmark changes on
>> top of changes to the simulator.
>>
>> Let me describe it for you and then if you still think it's a good idea
>> I'll make a patch.
>>
>> My workaround relies on the fact that:
>> 1. Launching a kernel sets up a completion signal that
>> hipDeviceSynchronize() ultimately waits on.
>> 2. All you need in the benchmark is that completion signal to know that
>> your kernel is complete.
>>
>> So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in
>> hsa_packet_processor.cc and gpu_command_processor.cc to receive commands
>> from the benchmark directly.
>> One of the commands is to steal the completion signal for a particular
>> kernel and pass it back to the benchmark.
>>
>> After you launch the kernel (normally) you pass in that kernel's id (you
>> have to keep track) then send a command to steal the completion signal. It
>> gets passed back in the return_address member of the agent packet.
>>
>> In the benchmark I store that signal and use it to do a
>> hsa_signal_wait_relaxed on it.
>> And you have to do this every time you launch a kernel, you could
>> conceivably overload the hipDeviceSynchronize() function to do this for
>> you/keep track of kernel launches too.
>>
>> Let me know if you think this is still something you guys want.
>>
>> Cheers,
>>
>> Dan
>>
>>
>> On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair <mattdsincl...@gmail.com>
>> wrote:
>>
>>> Hi Dan,
>>>
>>> Do you mind pushing your workaround with the completion signal as a
>>> patch to the staging branch so we can take a look?  Or is this just a
>>> change to the program(s) itself?
>>>
>>> After Kyle's fix (which has been pushed as an update to my patch), we're
>>> still seeing some hipDeviceSynchronize failures.  So we're interested in
>>> looking at what you did to see if it solves the problem.
>>>
>>> Matt
>>>
>>> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty <kroa...@wisc.edu> wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> Another thing to try is to add and set the environment variable HIP_DB
>>>> in apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
>>>> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
>>>> hipDeviceSynchronize() calls for the applications I've tested.
>>>>
>>>> I had traced through this issue (or at least one that manifests the
>>>> same way) a while back, and what I remember is that the crash happens
>>>> somewhere in the HIP code, and it occurs because somewhere much earlier we
>>>> go down a codepath that doesn't clear a register (I believe that was also
>>>> in HIP code). That register then gets re-used until the error propagates to
>>>> the register used in the ld instruction. Unfortunately, I had a hard time
>>>> of getting consistent, manageable traces, so I wasn't able to figure out
>>>> why we were going down the wrong codepath.
>>>>
>>>> Kyle
>>>> ------------------------------
>>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>>>> *Sent:* Friday, June 19, 2020 2:08 PM
>>>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com>
>>>> *Cc:* GAURAV JAIN <gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>;
>>>> gem5 users mailing list <gem5-users@gem5.org>
>>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>>>
>>>> Thanks Dan.  Kyle R. has found some things about the patch that we're
>>>> testing and may need to be pushed pending those results.  Fingers crossed
>>>> that fix will help you too.
>>>>
>>>> As Gaurav mentioned previously, the spin flag did not always solve the
>>>> problem for us -- seems like that is true for you too, although I don't
>>>> remember square ever failing for us.
>>>>
>>>> I don't know exactly where that PC is coming from, I'd have to get a
>>>> trace.  But I suspect it's actually a GPU address being accessed by some
>>>> instruction that's failing -- in the past when I've seen this kind of
>>>> issue, it was happening because the kernel boundary was not being respected
>>>> and code was running that shouldn't have been running yet.  I don't know
>>>> what your use case is, so it's possible that is not the issue for you -- a
>>>> trace would be the only way to know for sure.
>>>>
>>>> Matt
>>>>
>>>> Regards,
>>>> Matt Sinclair
>>>> Assistant Professor
>>>> University of Wisconsin-Madison
>>>> Computer Sciences Department
>>>> cs.wisc.edu/~sinclair
>>>>
>>>> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy <
>>>> daniel.gerz...@gmail.com> wrote:
>>>>
>>>> Hey Matt,
>>>>
>>>> Thanks for pushing those changes. I updated the head of the amd staging
>>>> branch and tried to run square. The time to get into main stays about the
>>>> same (5min) FYI.
>>>>
>>>> But the hipDeviceSynchronize() fails even when I add 
>>>> hipSetDeviceFlags(hipDeviceScheduleSpin);
>>>> unfortunately.
>>>>
>>>>  panic: Tried to read unmapped address 0x1853e78.
>>>> PC: 0x7ffff52a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]
>>>>
>>>> Is that PC (0x7ffff52a966b( somewhere in the hip code or something?
>>>> ....oooor in the emulated driver? The line between the simulator code and
>>>> guest code is kind of blurry to me around there haha.
>>>>
>>>> Best,
>>>>
>>>> Dan
>>>>
>>>> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair <mattdsincl...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Dan,
>>>>
>>>> Thanks for the update.  Apologies for the delay, the patch didn't apply
>>>> cleanly initially, but I have pushed the patch I promised previously.
>>>> Since I'm not sure if you're on the develop branch or the AMD staging
>>>> branch, I pushed it to both (there are some differences in code on the
>>>> branches, which I hope will be resolved over time as more of the commits
>>>> from the staging branch are pushed to develop:
>>>>
>>>> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
>>>> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335
>>>>
>>>> I have validated that both of them compile, and asked Kyle R to test
>>>> that both of them a) don't break anything that is expected to work publicly
>>>> with the GPU and b) hopefully resolve some of the problems (like yours)
>>>> with barrier synchronization.  Let us know if this solves your problem too
>>>> -- fingers crossed.
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy <
>>>> daniel.gerz...@gmail.com> wrote:
>>>>
>>>>   Matt,
>>>>
>>>> It wasn't so much a solution as an explanation. Kyle was running on an
>>>> r5 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>>>>
>>>> The relative difference in clock speed seems to me to be a more
>>>> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual
>>>> time before min) than the 8 min (time before main + exit time) I was seeing
>>>> before.
>>>>
>>>> I'll update to the latest branch and see if that speeds me up further.
>>>> I'm also going to try running on a faster machine as well though that will
>>>> take some setup-time.
>>>>
>>>> Gaurav,
>>>>
>>>> Thanks for the tip, that will be helpful in the meantime.
>>>>
>>>> Dan
>>>>
>>>> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN <gja...@wisc.edu> wrote:
>>>>
>>>> Hi,
>>>>
>>>> I am not sure if chiming in now would cause any more confusion, but
>>>> still giving it a try.
>>>>
>>>> @Daniel Gerzhoy <daniel.gerz...@gmail.com> - for hipDeviceSynchronize,
>>>> as Matt mentioned, they are working on a fix and should have it out there.
>>>> If you want to, can you try this:
>>>>
>>>>     hipSetDeviceFlags(hipDeviceScheduleSpin);
>>>>     for (int k = 1; k < dim; k++) {
>>>>         hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads,
>>>> 0, 0);
>>>>         hipDeviceSynchronize();
>>>>     }
>>>>
>>>> For me, in many cases (not all and in the ones which it didn't work, I
>>>> got the same error unmapped error as you), this seemed like doing the
>>>> trick. You should checkout the HEAD and then try this. I am not hoping for
>>>> it to make any difference but still worth a shot.
>>>>
>>>>
>>>> ------------------------------
>>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>>>> *Sent:* Friday, June 12, 2020 2:14 PM
>>>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com>
>>>> *Cc:* Kyle Roarty <kroa...@wisc.edu>; GAURAV JAIN <gja...@wisc.edu>;
>>>> gem5 users mailing list <gem5-users@gem5.org>
>>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>>>
>>>> Hi Dan,
>>>>
>>>> Glad to hear things are working, and thanks for the tips!  I must admit
>>>> to not quite following what the solution was though -- are you saying the
>>>> solution is to replace exit(0)/return with m5_exit()?  I thought your
>>>> original post said the problem was things taking a really long time before
>>>> main?  If so, it would seem like something else must have been the
>>>> problem/solution?
>>>>
>>>> Coming to your other questions: I don't recall what exactly the root
>>>> cause of the hipDeviceSynchronize failure is, but I would definitely
>>>> recommend updating to the current staging branch head first and testing.  I
>>>> am also hoping to push a fix today to the barrier bit synchronization --
>>>> most of the hipDeviceSynchronize-type failures I've seen were due to a bug
>>>> in my barrier bit implementation.  I'm not sure if this will be the
>>>> solution to your problem or not, but I can definitely add you as a reviewer
>>>> and/or point you to it if needed.
>>>>
>>>> Not sure about the m5op, hopefully someone else can chime in on that.
>>>>
>>>> Thanks,
>>>> Matt
>>>>
>>>> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy <
>>>> daniel.gerz...@gmail.com> wrote:
>>>>
>>>> I've figured it out.
>>>>
>>>> To measure the time it took to get to main() I put a *return 0; *at
>>>> the beginning of the function so I wouldn't have to babysit it.
>>>>
>>>> I didn't consider that it would also take some time for the simulator
>>>> to exit, which is where the extra few minutes comes from.
>>>> Side-note: *m5_exit(0);* instead of a return exits immediately.
>>>>
>>>> 5 min is a bit more reasonable of a slowdown for the difference between
>>>> the two clocks.
>>>>
>>>> Two incidental things:
>>>>
>>>> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps
>>>> while it's printing stuff?
>>>> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes
>>>> (panic: Tried to read unmapped address 0xff0000c29f48.). Has this been
>>>> fixed since?
>>>>
>>>> I'm going to update to the head of this branch soon, and eventually to
>>>> the main branch. If it hasn't been fixed I've created a workaround by
>>>> stealing the completion signal of the kernel based on its launch id, and
>>>> manually waiting for it using the HSA interface.
>>>> Happy to help out and implement this as a m5op (or something) if that
>>>> would be helpful for you guys.
>>>>
>>>> Best,
>>>>
>>>> Dan
>>>>
>>>> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair <mattdsincl...@gmail.com>
>>>> wrote:
>>>>
>>>> I don't see anything amazingly amiss in your output, but the number of
>>>> times the open/etc. fail is interesting -- Kyle do we see the same thing?
>>>> If not, it could be that you should update your apu_se.py to point to the
>>>> "correct" place to search for the libraries first?
>>>>
>>>> Also, based on Kyle's reply, Dan how long does it take you to boot up
>>>> square?  Certainly a slower machine might take longer, but it does seem
>>>> even slower than expected.  But if we're trying the same application, maybe
>>>> it will be easier to spot differences.
>>>>
>>>> I would also recommend updating to the latest commit on the staging
>>>> branch -- I don't believe it should break anything with those patches.
>>>>
>>>> Yes, looks like you are using the release version of ROCm -- no issues
>>>> there.
>>>>
>>>> Matt
>>>>
>>>>
>>>>
>>>> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy <
>>>> daniel.gerz...@gmail.com> wrote:
>>>>
>>>> I am using the docker, yeah.
>>>> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 -
>>>> 2.8 GHz) which might make up some of the difference, the r5 3600 has a
>>>> faster clock (3.6-4.2 GHz).
>>>>
>>>> I've hesitated to update my branch because in the Dockerfile it
>>>> specifically checks this branch out and applies a patch, though the patch
>>>> isn't very extensive.
>>>> This was from a while back (November maybe?) and I know you guys have
>>>> been integrating things into the main branch (thanks!)
>>>> I was thinking I would wait until it's fully merged into the mainline
>>>> gem5 branch and rebase onto that and try to merge my changes in.
>>>>
>>>> Last I checked the GCN3 stuff is in the dev branch not the master right?
>>>>
>>>> But if it will help maybe I should update to the head of this branch.
>>>> Will I need to update the docker as well?
>>>>
>>>> As for the debug vs release rocm I think I'm using the release version.
>>>> This is what the dockerfile built:
>>>>
>>>> ARG rocm_ver=1.6.2
>>>> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>>>> <http://repo.radeon.com/rocm/archive/apt_$%7Brocm_ver%7D.tar.bz2> \
>>>>     | tar -xjv \
>>>>     && cd apt_${rocm_ver}/pool/main/ \
>>>>     && dpkg -i h/hsakmt-roct-dev/* \
>>>>     && dpkg -i h/hsa-ext-rocr-dev/* \
>>>>     && dpkg -i h/hsa-rocr-dev/* \
>>>>     && dpkg -i r/rocm-utils/* \
>>>>     && dpkg -i h/hcc/* \
>>>>     && dpkg -i h/hip_base/* \
>>>>     && dpkg -i h/hip_hcc/* \
>>>>     && dpkg -i h/hip_samples/*
>>>>
>>>>
>>>> I ran a benchmark that prints that it entered main and returns
>>>> immediately, this took 9 minutes.
>>>> I've attached a debug trace with debug flags =
>>>> "GPUDriver,SyscallVerbose"
>>>> There's a lot of weird things going on, "syscall open: failed",
>>>> "syscall brk: break point changed to [...]", and lots of ignored system
>>>> calls.
>>>>
>>>> head of Stats for reference:
>>>> ---------- Begin Simulation Statistics ----------
>>>> sim_seconds                                  0.096192
>>>>     # Number of seconds simulated
>>>> sim_ticks                                 96192368500
>>>>     # Number of ticks simulated
>>>> final_tick                                96192368500
>>>>     # Number of ticks from beginning of simulation (restored from
>>>> checkpoints and never reset)
>>>> sim_freq                                 1000000000000
>>>>       # Frequency of simulated ticks
>>>> host_inst_rate                                 175209
>>>>     # Simulator instruction rate (inst/s)
>>>> host_op_rate                                   338409
>>>>     # Simulator op (including micro ops) rate (op/s)
>>>> host_tick_rate                              175362515
>>>>     # Simulator tick rate (ticks/s)
>>>> host_mem_usage                                1628608
>>>>     # Number of bytes of host memory used
>>>> host_seconds                                   548.53
>>>>     # Real time elapsed on the host
>>>> sim_insts                                    96108256
>>>>     # Number of instructions simulated
>>>> sim_ops                                     185628785
>>>>     # Number of ops (including micro ops) simulated
>>>> system.voltage_domain.voltage                       1
>>>>     # Voltage in Volts
>>>> system.clk_domain.clock                          1000
>>>>     # Clock period in ticks
>>>>
>>>> Maybe something in the attached file explains it better than I can
>>>> express.
>>>>
>>>> Many thanks for your help and hard work!
>>>>
>>>> Dan
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty <kroa...@wisc.edu> wrote:
>>>>
>>>> Running through a few applications, it took me about 2.5 minutes or
>>>> less each time using docker to start executing the program on an r5 3600.
>>>>
>>>> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
>>>> took about 1-1.5 mins.
>>>>
>>>> I ran conv_bench and rnn_bench from DeepBench which took just about 2
>>>> minutes.
>>>>
>>>> Because of that, it's possible the size of the app has an effect on
>>>> setup time, as the HIP apps are extremely small.
>>>>
>>>> Also, the commit Dan is checked out on is d0945dc
>>>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>
>>>>  mem-ruby:
>>>> add cache hit/miss statistics for TCP and TCC
>>>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>,
>>>> which isn't the most recent commit. I don't believe that that would account
>>>> for such a large slowdown, but it doesn't hurt to try the newest commit
>>>> unless it breaks something.
>>>>
>>>> Kyle
>>>> ------------------------------
>>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com>
>>>> *Sent:* Thursday, June 11, 2020 1:15 AM
>>>> *To:* gem5 users mailing list <gem5-users@gem5.org>
>>>> *Cc:* Daniel Gerzhoy <daniel.gerz...@gmail.com>; GAURAV JAIN <
>>>> gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>
>>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>>>
>>>> Gaurav & Kyle, do you know if this is the case?
>>>>
>>>> Dan, I believe the short answer is yes although 7-8 minutes seems a
>>>> little long.  Are you running this in Kyle's Docker, or separately?  If in
>>>> the Docker, that does increase the overhead somewhat, so running it
>>>> directly on a system would likely reduce the overhead somewhat.  Also, are
>>>> you running with the release or debug version of the ROCm drivers?  Again,
>>>> debug version will likely add some time to this.
>>>>
>>>> Matt
>>>>
>>>> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
>>>> gem5-users@gem5.org> wrote:
>>>>
>>>> I've been running simulations using the GCN3 branch:
>>>>
>>>> rocm_ver=1.6.2
>>>> $git branch
>>>>    * (HEAD detached at d0945dc)
>>>>       agutierr/master-gcn3-staging
>>>>
>>>> And I've noticed that it takes roughly 7-8 minutes to get to main()
>>>>
>>>> I'm guessing that this is the simulator setting up drivers?
>>>> Is that correct? Is there other stuff going on?
>>>>
>>>> *Has anyone found a way to speed this up? *
>>>>
>>>> I am trying to get some of the rodinia benchmarks from the HIP-Examples
>>>> running and debugging takes a long time as a result.
>>>>
>>>> I suspect that this is unavoidable but I won't know if I don't ask!
>>>>
>>>> Cheers,
>>>>
>>>> Dan Gerzhoy
>>>> _______________________________________________
>>>> gem5-users mailing list -- gem5-users@gem5.org
>>>> To unsubscribe send an email to gem5-users-le...@gem5.org
>>>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
>>>>
>>>>

_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

Reply via email to