Sounds good. I'll generate a patch for the staging branch in the next few days.
Dan On Mon, Jun 22, 2020 at 2:50 PM Matt Sinclair <mattdsincl...@gmail.com> wrote: > In my opinion, adding support for HSA_PACKET_TYPE_AGENT_DISPATCH, > irrelevant of the current issues, is worthwhile and helpful to push. > > If you have a minimum working example of how you change the benchmark, > that would be helpful too. > > Kyle R. has spent a bunch of time trying to identify the source of the > problem within the synchronize call, but thus far we haven't found anything > concrete. So for now, having this workaround would definitely be helpful > for the community. > > Matt > > On Mon, Jun 22, 2020 at 1:25 PM Daniel Gerzhoy <daniel.gerz...@gmail.com> > wrote: > >> Hey Matt, >> >> Happy to do that if you think it's viable, but I have to say my >> workaround is pretty hack-y. There are definitely some benchmark changes on >> top of changes to the simulator. >> >> Let me describe it for you and then if you still think it's a good idea >> I'll make a patch. >> >> My workaround relies on the fact that: >> 1. Launching a kernel sets up a completion signal that >> hipDeviceSynchronize() ultimately waits on. >> 2. All you need in the benchmark is that completion signal to know that >> your kernel is complete. >> >> So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in >> hsa_packet_processor.cc and gpu_command_processor.cc to receive commands >> from the benchmark directly. >> One of the commands is to steal the completion signal for a particular >> kernel and pass it back to the benchmark. >> >> After you launch the kernel (normally) you pass in that kernel's id (you >> have to keep track) then send a command to steal the completion signal. It >> gets passed back in the return_address member of the agent packet. >> >> In the benchmark I store that signal and use it to do a >> hsa_signal_wait_relaxed on it. >> And you have to do this every time you launch a kernel, you could >> conceivably overload the hipDeviceSynchronize() function to do this for >> you/keep track of kernel launches too. >> >> Let me know if you think this is still something you guys want. >> >> Cheers, >> >> Dan >> >> >> On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair <mattdsincl...@gmail.com> >> wrote: >> >>> Hi Dan, >>> >>> Do you mind pushing your workaround with the completion signal as a >>> patch to the staging branch so we can take a look? Or is this just a >>> change to the program(s) itself? >>> >>> After Kyle's fix (which has been pushed as an update to my patch), we're >>> still seeing some hipDeviceSynchronize failures. So we're interested in >>> looking at what you did to see if it solves the problem. >>> >>> Matt >>> >>> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty <kroa...@wisc.edu> wrote: >>> >>>> Hi Dan, >>>> >>>> Another thing to try is to add and set the environment variable HIP_DB >>>> in apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") . >>>> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on >>>> hipDeviceSynchronize() calls for the applications I've tested. >>>> >>>> I had traced through this issue (or at least one that manifests the >>>> same way) a while back, and what I remember is that the crash happens >>>> somewhere in the HIP code, and it occurs because somewhere much earlier we >>>> go down a codepath that doesn't clear a register (I believe that was also >>>> in HIP code). That register then gets re-used until the error propagates to >>>> the register used in the ld instruction. Unfortunately, I had a hard time >>>> of getting consistent, manageable traces, so I wasn't able to figure out >>>> why we were going down the wrong codepath. >>>> >>>> Kyle >>>> ------------------------------ >>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >>>> *Sent:* Friday, June 19, 2020 2:08 PM >>>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com> >>>> *Cc:* GAURAV JAIN <gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>; >>>> gem5 users mailing list <gem5-users@gem5.org> >>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >>>> >>>> Thanks Dan. Kyle R. has found some things about the patch that we're >>>> testing and may need to be pushed pending those results. Fingers crossed >>>> that fix will help you too. >>>> >>>> As Gaurav mentioned previously, the spin flag did not always solve the >>>> problem for us -- seems like that is true for you too, although I don't >>>> remember square ever failing for us. >>>> >>>> I don't know exactly where that PC is coming from, I'd have to get a >>>> trace. But I suspect it's actually a GPU address being accessed by some >>>> instruction that's failing -- in the past when I've seen this kind of >>>> issue, it was happening because the kernel boundary was not being respected >>>> and code was running that shouldn't have been running yet. I don't know >>>> what your use case is, so it's possible that is not the issue for you -- a >>>> trace would be the only way to know for sure. >>>> >>>> Matt >>>> >>>> Regards, >>>> Matt Sinclair >>>> Assistant Professor >>>> University of Wisconsin-Madison >>>> Computer Sciences Department >>>> cs.wisc.edu/~sinclair >>>> >>>> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy < >>>> daniel.gerz...@gmail.com> wrote: >>>> >>>> Hey Matt, >>>> >>>> Thanks for pushing those changes. I updated the head of the amd staging >>>> branch and tried to run square. The time to get into main stays about the >>>> same (5min) FYI. >>>> >>>> But the hipDeviceSynchronize() fails even when I add >>>> hipSetDeviceFlags(hipDeviceScheduleSpin); >>>> unfortunately. >>>> >>>> panic: Tried to read unmapped address 0x1853e78. >>>> PC: 0x7ffff52a966b, Instr: MOV_R_M : ld rdi, DS:[rbx + 0x8] >>>> >>>> Is that PC (0x7ffff52a966b( somewhere in the hip code or something? >>>> ....oooor in the emulated driver? The line between the simulator code and >>>> guest code is kind of blurry to me around there haha. >>>> >>>> Best, >>>> >>>> Dan >>>> >>>> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair <mattdsincl...@gmail.com> >>>> wrote: >>>> >>>> Hi Dan, >>>> >>>> Thanks for the update. Apologies for the delay, the patch didn't apply >>>> cleanly initially, but I have pushed the patch I promised previously. >>>> Since I'm not sure if you're on the develop branch or the AMD staging >>>> branch, I pushed it to both (there are some differences in code on the >>>> branches, which I hope will be resolved over time as more of the commits >>>> from the staging branch are pushed to develop: >>>> >>>> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354 >>>> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335 >>>> >>>> I have validated that both of them compile, and asked Kyle R to test >>>> that both of them a) don't break anything that is expected to work publicly >>>> with the GPU and b) hopefully resolve some of the problems (like yours) >>>> with barrier synchronization. Let us know if this solves your problem too >>>> -- fingers crossed. >>>> >>>> Thanks, >>>> Matt >>>> >>>> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy < >>>> daniel.gerz...@gmail.com> wrote: >>>> >>>> Matt, >>>> >>>> It wasn't so much a solution as an explanation. Kyle was running on an >>>> r5 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz) >>>> >>>> The relative difference in clock speed seems to me to be a more >>>> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual >>>> time before min) than the 8 min (time before main + exit time) I was seeing >>>> before. >>>> >>>> I'll update to the latest branch and see if that speeds me up further. >>>> I'm also going to try running on a faster machine as well though that will >>>> take some setup-time. >>>> >>>> Gaurav, >>>> >>>> Thanks for the tip, that will be helpful in the meantime. >>>> >>>> Dan >>>> >>>> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN <gja...@wisc.edu> wrote: >>>> >>>> Hi, >>>> >>>> I am not sure if chiming in now would cause any more confusion, but >>>> still giving it a try. >>>> >>>> @Daniel Gerzhoy <daniel.gerz...@gmail.com> - for hipDeviceSynchronize, >>>> as Matt mentioned, they are working on a fix and should have it out there. >>>> If you want to, can you try this: >>>> >>>> hipSetDeviceFlags(hipDeviceScheduleSpin); >>>> for (int k = 1; k < dim; k++) { >>>> hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, >>>> 0, 0); >>>> hipDeviceSynchronize(); >>>> } >>>> >>>> For me, in many cases (not all and in the ones which it didn't work, I >>>> got the same error unmapped error as you), this seemed like doing the >>>> trick. You should checkout the HEAD and then try this. I am not hoping for >>>> it to make any difference but still worth a shot. >>>> >>>> >>>> ------------------------------ >>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >>>> *Sent:* Friday, June 12, 2020 2:14 PM >>>> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com> >>>> *Cc:* Kyle Roarty <kroa...@wisc.edu>; GAURAV JAIN <gja...@wisc.edu>; >>>> gem5 users mailing list <gem5-users@gem5.org> >>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >>>> >>>> Hi Dan, >>>> >>>> Glad to hear things are working, and thanks for the tips! I must admit >>>> to not quite following what the solution was though -- are you saying the >>>> solution is to replace exit(0)/return with m5_exit()? I thought your >>>> original post said the problem was things taking a really long time before >>>> main? If so, it would seem like something else must have been the >>>> problem/solution? >>>> >>>> Coming to your other questions: I don't recall what exactly the root >>>> cause of the hipDeviceSynchronize failure is, but I would definitely >>>> recommend updating to the current staging branch head first and testing. I >>>> am also hoping to push a fix today to the barrier bit synchronization -- >>>> most of the hipDeviceSynchronize-type failures I've seen were due to a bug >>>> in my barrier bit implementation. I'm not sure if this will be the >>>> solution to your problem or not, but I can definitely add you as a reviewer >>>> and/or point you to it if needed. >>>> >>>> Not sure about the m5op, hopefully someone else can chime in on that. >>>> >>>> Thanks, >>>> Matt >>>> >>>> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy < >>>> daniel.gerz...@gmail.com> wrote: >>>> >>>> I've figured it out. >>>> >>>> To measure the time it took to get to main() I put a *return 0; *at >>>> the beginning of the function so I wouldn't have to babysit it. >>>> >>>> I didn't consider that it would also take some time for the simulator >>>> to exit, which is where the extra few minutes comes from. >>>> Side-note: *m5_exit(0);* instead of a return exits immediately. >>>> >>>> 5 min is a bit more reasonable of a slowdown for the difference between >>>> the two clocks. >>>> >>>> Two incidental things: >>>> >>>> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps >>>> while it's printing stuff? >>>> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes >>>> (panic: Tried to read unmapped address 0xff0000c29f48.). Has this been >>>> fixed since? >>>> >>>> I'm going to update to the head of this branch soon, and eventually to >>>> the main branch. If it hasn't been fixed I've created a workaround by >>>> stealing the completion signal of the kernel based on its launch id, and >>>> manually waiting for it using the HSA interface. >>>> Happy to help out and implement this as a m5op (or something) if that >>>> would be helpful for you guys. >>>> >>>> Best, >>>> >>>> Dan >>>> >>>> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair <mattdsincl...@gmail.com> >>>> wrote: >>>> >>>> I don't see anything amazingly amiss in your output, but the number of >>>> times the open/etc. fail is interesting -- Kyle do we see the same thing? >>>> If not, it could be that you should update your apu_se.py to point to the >>>> "correct" place to search for the libraries first? >>>> >>>> Also, based on Kyle's reply, Dan how long does it take you to boot up >>>> square? Certainly a slower machine might take longer, but it does seem >>>> even slower than expected. But if we're trying the same application, maybe >>>> it will be easier to spot differences. >>>> >>>> I would also recommend updating to the latest commit on the staging >>>> branch -- I don't believe it should break anything with those patches. >>>> >>>> Yes, looks like you are using the release version of ROCm -- no issues >>>> there. >>>> >>>> Matt >>>> >>>> >>>> >>>> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy < >>>> daniel.gerz...@gmail.com> wrote: >>>> >>>> I am using the docker, yeah. >>>> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - >>>> 2.8 GHz) which might make up some of the difference, the r5 3600 has a >>>> faster clock (3.6-4.2 GHz). >>>> >>>> I've hesitated to update my branch because in the Dockerfile it >>>> specifically checks this branch out and applies a patch, though the patch >>>> isn't very extensive. >>>> This was from a while back (November maybe?) and I know you guys have >>>> been integrating things into the main branch (thanks!) >>>> I was thinking I would wait until it's fully merged into the mainline >>>> gem5 branch and rebase onto that and try to merge my changes in. >>>> >>>> Last I checked the GCN3 stuff is in the dev branch not the master right? >>>> >>>> But if it will help maybe I should update to the head of this branch. >>>> Will I need to update the docker as well? >>>> >>>> As for the debug vs release rocm I think I'm using the release version. >>>> This is what the dockerfile built: >>>> >>>> ARG rocm_ver=1.6.2 >>>> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2 >>>> <http://repo.radeon.com/rocm/archive/apt_$%7Brocm_ver%7D.tar.bz2> \ >>>> | tar -xjv \ >>>> && cd apt_${rocm_ver}/pool/main/ \ >>>> && dpkg -i h/hsakmt-roct-dev/* \ >>>> && dpkg -i h/hsa-ext-rocr-dev/* \ >>>> && dpkg -i h/hsa-rocr-dev/* \ >>>> && dpkg -i r/rocm-utils/* \ >>>> && dpkg -i h/hcc/* \ >>>> && dpkg -i h/hip_base/* \ >>>> && dpkg -i h/hip_hcc/* \ >>>> && dpkg -i h/hip_samples/* >>>> >>>> >>>> I ran a benchmark that prints that it entered main and returns >>>> immediately, this took 9 minutes. >>>> I've attached a debug trace with debug flags = >>>> "GPUDriver,SyscallVerbose" >>>> There's a lot of weird things going on, "syscall open: failed", >>>> "syscall brk: break point changed to [...]", and lots of ignored system >>>> calls. >>>> >>>> head of Stats for reference: >>>> ---------- Begin Simulation Statistics ---------- >>>> sim_seconds 0.096192 >>>> # Number of seconds simulated >>>> sim_ticks 96192368500 >>>> # Number of ticks simulated >>>> final_tick 96192368500 >>>> # Number of ticks from beginning of simulation (restored from >>>> checkpoints and never reset) >>>> sim_freq 1000000000000 >>>> # Frequency of simulated ticks >>>> host_inst_rate 175209 >>>> # Simulator instruction rate (inst/s) >>>> host_op_rate 338409 >>>> # Simulator op (including micro ops) rate (op/s) >>>> host_tick_rate 175362515 >>>> # Simulator tick rate (ticks/s) >>>> host_mem_usage 1628608 >>>> # Number of bytes of host memory used >>>> host_seconds 548.53 >>>> # Real time elapsed on the host >>>> sim_insts 96108256 >>>> # Number of instructions simulated >>>> sim_ops 185628785 >>>> # Number of ops (including micro ops) simulated >>>> system.voltage_domain.voltage 1 >>>> # Voltage in Volts >>>> system.clk_domain.clock 1000 >>>> # Clock period in ticks >>>> >>>> Maybe something in the attached file explains it better than I can >>>> express. >>>> >>>> Many thanks for your help and hard work! >>>> >>>> Dan >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty <kroa...@wisc.edu> wrote: >>>> >>>> Running through a few applications, it took me about 2.5 minutes or >>>> less each time using docker to start executing the program on an r5 3600. >>>> >>>> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which >>>> took about 1-1.5 mins. >>>> >>>> I ran conv_bench and rnn_bench from DeepBench which took just about 2 >>>> minutes. >>>> >>>> Because of that, it's possible the size of the app has an effect on >>>> setup time, as the HIP apps are extremely small. >>>> >>>> Also, the commit Dan is checked out on is d0945dc >>>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639> >>>> mem-ruby: >>>> add cache hit/miss statistics for TCP and TCC >>>> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>, >>>> which isn't the most recent commit. I don't believe that that would account >>>> for such a large slowdown, but it doesn't hurt to try the newest commit >>>> unless it breaks something. >>>> >>>> Kyle >>>> ------------------------------ >>>> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >>>> *Sent:* Thursday, June 11, 2020 1:15 AM >>>> *To:* gem5 users mailing list <gem5-users@gem5.org> >>>> *Cc:* Daniel Gerzhoy <daniel.gerz...@gmail.com>; GAURAV JAIN < >>>> gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu> >>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >>>> >>>> Gaurav & Kyle, do you know if this is the case? >>>> >>>> Dan, I believe the short answer is yes although 7-8 minutes seems a >>>> little long. Are you running this in Kyle's Docker, or separately? If in >>>> the Docker, that does increase the overhead somewhat, so running it >>>> directly on a system would likely reduce the overhead somewhat. Also, are >>>> you running with the release or debug version of the ROCm drivers? Again, >>>> debug version will likely add some time to this. >>>> >>>> Matt >>>> >>>> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users < >>>> gem5-users@gem5.org> wrote: >>>> >>>> I've been running simulations using the GCN3 branch: >>>> >>>> rocm_ver=1.6.2 >>>> $git branch >>>> * (HEAD detached at d0945dc) >>>> agutierr/master-gcn3-staging >>>> >>>> And I've noticed that it takes roughly 7-8 minutes to get to main() >>>> >>>> I'm guessing that this is the simulator setting up drivers? >>>> Is that correct? Is there other stuff going on? >>>> >>>> *Has anyone found a way to speed this up? * >>>> >>>> I am trying to get some of the rodinia benchmarks from the HIP-Examples >>>> running and debugging takes a long time as a result. >>>> >>>> I suspect that this is unavoidable but I won't know if I don't ask! >>>> >>>> Cheers, >>>> >>>> Dan Gerzhoy >>>> _______________________________________________ >>>> gem5-users mailing list -- gem5-users@gem5.org >>>> To unsubscribe send an email to gem5-users-le...@gem5.org >>>> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s >>>> >>>>
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s