Hey Matt, Happy to do that if you think it's viable, but I have to say my workaround is pretty hack-y. There are definitely some benchmark changes on top of changes to the simulator.
Let me describe it for you and then if you still think it's a good idea I'll make a patch. My workaround relies on the fact that: 1. Launching a kernel sets up a completion signal that hipDeviceSynchronize() ultimately waits on. 2. All you need in the benchmark is that completion signal to know that your kernel is complete. So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in hsa_packet_processor.cc and gpu_command_processor.cc to receive commands from the benchmark directly. One of the commands is to steal the completion signal for a particular kernel and pass it back to the benchmark. After you launch the kernel (normally) you pass in that kernel's id (you have to keep track) then send a command to steal the completion signal. It gets passed back in the return_address member of the agent packet. In the benchmark I store that signal and use it to do a hsa_signal_wait_relaxed on it. And you have to do this every time you launch a kernel, you could conceivably overload the hipDeviceSynchronize() function to do this for you/keep track of kernel launches too. Let me know if you think this is still something you guys want. Cheers, Dan On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair <mattdsincl...@gmail.com> wrote: > Hi Dan, > > Do you mind pushing your workaround with the completion signal as a patch > to the staging branch so we can take a look? Or is this just a change to > the program(s) itself? > > After Kyle's fix (which has been pushed as an update to my patch), we're > still seeing some hipDeviceSynchronize failures. So we're interested in > looking at what you did to see if it solves the problem. > > Matt > > On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty <kroa...@wisc.edu> wrote: > >> Hi Dan, >> >> Another thing to try is to add and set the environment variable HIP_DB in >> apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") . >> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on >> hipDeviceSynchronize() calls for the applications I've tested. >> >> I had traced through this issue (or at least one that manifests the same >> way) a while back, and what I remember is that the crash happens somewhere >> in the HIP code, and it occurs because somewhere much earlier we go down a >> codepath that doesn't clear a register (I believe that was also in HIP >> code). That register then gets re-used until the error propagates to the >> register used in the ld instruction. Unfortunately, I had a hard time of >> getting consistent, manageable traces, so I wasn't able to figure out why >> we were going down the wrong codepath. >> >> Kyle >> ------------------------------ >> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >> *Sent:* Friday, June 19, 2020 2:08 PM >> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com> >> *Cc:* GAURAV JAIN <gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu>; >> gem5 users mailing list <gem5-users@gem5.org> >> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >> >> Thanks Dan. Kyle R. has found some things about the patch that we're >> testing and may need to be pushed pending those results. Fingers crossed >> that fix will help you too. >> >> As Gaurav mentioned previously, the spin flag did not always solve the >> problem for us -- seems like that is true for you too, although I don't >> remember square ever failing for us. >> >> I don't know exactly where that PC is coming from, I'd have to get a >> trace. But I suspect it's actually a GPU address being accessed by some >> instruction that's failing -- in the past when I've seen this kind of >> issue, it was happening because the kernel boundary was not being respected >> and code was running that shouldn't have been running yet. I don't know >> what your use case is, so it's possible that is not the issue for you -- a >> trace would be the only way to know for sure. >> >> Matt >> >> Regards, >> Matt Sinclair >> Assistant Professor >> University of Wisconsin-Madison >> Computer Sciences Department >> cs.wisc.edu/~sinclair >> >> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy <daniel.gerz...@gmail.com> >> wrote: >> >> Hey Matt, >> >> Thanks for pushing those changes. I updated the head of the amd staging >> branch and tried to run square. The time to get into main stays about the >> same (5min) FYI. >> >> But the hipDeviceSynchronize() fails even when I add >> hipSetDeviceFlags(hipDeviceScheduleSpin); >> unfortunately. >> >> panic: Tried to read unmapped address 0x1853e78. >> PC: 0x7ffff52a966b, Instr: MOV_R_M : ld rdi, DS:[rbx + 0x8] >> >> Is that PC (0x7ffff52a966b( somewhere in the hip code or something? >> ....oooor in the emulated driver? The line between the simulator code and >> guest code is kind of blurry to me around there haha. >> >> Best, >> >> Dan >> >> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair <mattdsincl...@gmail.com> >> wrote: >> >> Hi Dan, >> >> Thanks for the update. Apologies for the delay, the patch didn't apply >> cleanly initially, but I have pushed the patch I promised previously. >> Since I'm not sure if you're on the develop branch or the AMD staging >> branch, I pushed it to both (there are some differences in code on the >> branches, which I hope will be resolved over time as more of the commits >> from the staging branch are pushed to develop: >> >> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354 >> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335 >> >> I have validated that both of them compile, and asked Kyle R to test that >> both of them a) don't break anything that is expected to work publicly with >> the GPU and b) hopefully resolve some of the problems (like yours) with >> barrier synchronization. Let us know if this solves your problem too -- >> fingers crossed. >> >> Thanks, >> Matt >> >> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy <daniel.gerz...@gmail.com> >> wrote: >> >> Matt, >> >> It wasn't so much a solution as an explanation. Kyle was running on an r5 >> 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz) >> >> The relative difference in clock speed seems to me to be a more >> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual >> time before min) than the 8 min (time before main + exit time) I was seeing >> before. >> >> I'll update to the latest branch and see if that speeds me up further. >> I'm also going to try running on a faster machine as well though that will >> take some setup-time. >> >> Gaurav, >> >> Thanks for the tip, that will be helpful in the meantime. >> >> Dan >> >> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN <gja...@wisc.edu> wrote: >> >> Hi, >> >> I am not sure if chiming in now would cause any more confusion, but still >> giving it a try. >> >> @Daniel Gerzhoy <daniel.gerz...@gmail.com> - for hipDeviceSynchronize, >> as Matt mentioned, they are working on a fix and should have it out there. >> If you want to, can you try this: >> >> hipSetDeviceFlags(hipDeviceScheduleSpin); >> for (int k = 1; k < dim; k++) { >> hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0, >> 0); >> hipDeviceSynchronize(); >> } >> >> For me, in many cases (not all and in the ones which it didn't work, I >> got the same error unmapped error as you), this seemed like doing the >> trick. You should checkout the HEAD and then try this. I am not hoping for >> it to make any difference but still worth a shot. >> >> >> ------------------------------ >> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >> *Sent:* Friday, June 12, 2020 2:14 PM >> *To:* Daniel Gerzhoy <daniel.gerz...@gmail.com> >> *Cc:* Kyle Roarty <kroa...@wisc.edu>; GAURAV JAIN <gja...@wisc.edu>; >> gem5 users mailing list <gem5-users@gem5.org> >> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >> >> Hi Dan, >> >> Glad to hear things are working, and thanks for the tips! I must admit >> to not quite following what the solution was though -- are you saying the >> solution is to replace exit(0)/return with m5_exit()? I thought your >> original post said the problem was things taking a really long time before >> main? If so, it would seem like something else must have been the >> problem/solution? >> >> Coming to your other questions: I don't recall what exactly the root >> cause of the hipDeviceSynchronize failure is, but I would definitely >> recommend updating to the current staging branch head first and testing. I >> am also hoping to push a fix today to the barrier bit synchronization -- >> most of the hipDeviceSynchronize-type failures I've seen were due to a bug >> in my barrier bit implementation. I'm not sure if this will be the >> solution to your problem or not, but I can definitely add you as a reviewer >> and/or point you to it if needed. >> >> Not sure about the m5op, hopefully someone else can chime in on that. >> >> Thanks, >> Matt >> >> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy <daniel.gerz...@gmail.com> >> wrote: >> >> I've figured it out. >> >> To measure the time it took to get to main() I put a *return 0; *at the >> beginning of the function so I wouldn't have to babysit it. >> >> I didn't consider that it would also take some time for the simulator to >> exit, which is where the extra few minutes comes from. >> Side-note: *m5_exit(0);* instead of a return exits immediately. >> >> 5 min is a bit more reasonable of a slowdown for the difference between >> the two clocks. >> >> Two incidental things: >> >> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps >> while it's printing stuff? >> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes >> (panic: Tried to read unmapped address 0xff0000c29f48.). Has this been >> fixed since? >> >> I'm going to update to the head of this branch soon, and eventually to >> the main branch. If it hasn't been fixed I've created a workaround by >> stealing the completion signal of the kernel based on its launch id, and >> manually waiting for it using the HSA interface. >> Happy to help out and implement this as a m5op (or something) if that >> would be helpful for you guys. >> >> Best, >> >> Dan >> >> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair <mattdsincl...@gmail.com> >> wrote: >> >> I don't see anything amazingly amiss in your output, but the number of >> times the open/etc. fail is interesting -- Kyle do we see the same thing? >> If not, it could be that you should update your apu_se.py to point to the >> "correct" place to search for the libraries first? >> >> Also, based on Kyle's reply, Dan how long does it take you to boot up >> square? Certainly a slower machine might take longer, but it does seem >> even slower than expected. But if we're trying the same application, maybe >> it will be easier to spot differences. >> >> I would also recommend updating to the latest commit on the staging >> branch -- I don't believe it should break anything with those patches. >> >> Yes, looks like you are using the release version of ROCm -- no issues >> there. >> >> Matt >> >> >> >> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy <daniel.gerz...@gmail.com> >> wrote: >> >> I am using the docker, yeah. >> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8 >> GHz) which might make up some of the difference, the r5 3600 has a faster >> clock (3.6-4.2 GHz). >> >> I've hesitated to update my branch because in the Dockerfile it >> specifically checks this branch out and applies a patch, though the patch >> isn't very extensive. >> This was from a while back (November maybe?) and I know you guys have >> been integrating things into the main branch (thanks!) >> I was thinking I would wait until it's fully merged into the mainline >> gem5 branch and rebase onto that and try to merge my changes in. >> >> Last I checked the GCN3 stuff is in the dev branch not the master right? >> >> But if it will help maybe I should update to the head of this branch. >> Will I need to update the docker as well? >> >> As for the debug vs release rocm I think I'm using the release version. >> This is what the dockerfile built: >> >> ARG rocm_ver=1.6.2 >> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2 >> <http://repo.radeon.com/rocm/archive/apt_$%7Brocm_ver%7D.tar.bz2> \ >> | tar -xjv \ >> && cd apt_${rocm_ver}/pool/main/ \ >> && dpkg -i h/hsakmt-roct-dev/* \ >> && dpkg -i h/hsa-ext-rocr-dev/* \ >> && dpkg -i h/hsa-rocr-dev/* \ >> && dpkg -i r/rocm-utils/* \ >> && dpkg -i h/hcc/* \ >> && dpkg -i h/hip_base/* \ >> && dpkg -i h/hip_hcc/* \ >> && dpkg -i h/hip_samples/* >> >> >> I ran a benchmark that prints that it entered main and returns >> immediately, this took 9 minutes. >> I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose" >> There's a lot of weird things going on, "syscall open: failed", "syscall >> brk: break point changed to [...]", and lots of ignored system calls. >> >> head of Stats for reference: >> ---------- Begin Simulation Statistics ---------- >> sim_seconds 0.096192 >> # Number of seconds simulated >> sim_ticks 96192368500 >> # Number of ticks simulated >> final_tick 96192368500 >> # Number of ticks from beginning of simulation (restored from checkpoints >> and never reset) >> sim_freq 1000000000000 >> # Frequency of simulated ticks >> host_inst_rate 175209 >> # Simulator instruction rate (inst/s) >> host_op_rate 338409 >> # Simulator op (including micro ops) rate (op/s) >> host_tick_rate 175362515 >> # Simulator tick rate (ticks/s) >> host_mem_usage 1628608 >> # Number of bytes of host memory used >> host_seconds 548.53 >> # Real time elapsed on the host >> sim_insts 96108256 >> # Number of instructions simulated >> sim_ops 185628785 >> # Number of ops (including micro ops) simulated >> system.voltage_domain.voltage 1 >> # Voltage in Volts >> system.clk_domain.clock 1000 >> # Clock period in ticks >> >> Maybe something in the attached file explains it better than I can >> express. >> >> Many thanks for your help and hard work! >> >> Dan >> >> >> >> >> >> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty <kroa...@wisc.edu> wrote: >> >> Running through a few applications, it took me about 2.5 minutes or less >> each time using docker to start executing the program on an r5 3600. >> >> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which >> took about 1-1.5 mins. >> >> I ran conv_bench and rnn_bench from DeepBench which took just about 2 >> minutes. >> >> Because of that, it's possible the size of the app has an effect on setup >> time, as the HIP apps are extremely small. >> >> Also, the commit Dan is checked out on is d0945dc >> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639> >> mem-ruby: >> add cache hit/miss statistics for TCP and TCC >> <https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>, >> which isn't the most recent commit. I don't believe that that would account >> for such a large slowdown, but it doesn't hurt to try the newest commit >> unless it breaks something. >> >> Kyle >> ------------------------------ >> *From:* mattdsincl...@gmail.com <mattdsincl...@gmail.com> >> *Sent:* Thursday, June 11, 2020 1:15 AM >> *To:* gem5 users mailing list <gem5-users@gem5.org> >> *Cc:* Daniel Gerzhoy <daniel.gerz...@gmail.com>; GAURAV JAIN < >> gja...@wisc.edu>; Kyle Roarty <kroa...@wisc.edu> >> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time >> >> Gaurav & Kyle, do you know if this is the case? >> >> Dan, I believe the short answer is yes although 7-8 minutes seems a >> little long. Are you running this in Kyle's Docker, or separately? If in >> the Docker, that does increase the overhead somewhat, so running it >> directly on a system would likely reduce the overhead somewhat. Also, are >> you running with the release or debug version of the ROCm drivers? Again, >> debug version will likely add some time to this. >> >> Matt >> >> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users < >> gem5-users@gem5.org> wrote: >> >> I've been running simulations using the GCN3 branch: >> >> rocm_ver=1.6.2 >> $git branch >> * (HEAD detached at d0945dc) >> agutierr/master-gcn3-staging >> >> And I've noticed that it takes roughly 7-8 minutes to get to main() >> >> I'm guessing that this is the simulator setting up drivers? >> Is that correct? Is there other stuff going on? >> >> *Has anyone found a way to speed this up? * >> >> I am trying to get some of the rodinia benchmarks from the HIP-Examples >> running and debugging takes a long time as a result. >> >> I suspect that this is unavoidable but I won't know if I don't ask! >> >> Cheers, >> >> Dan Gerzhoy >> _______________________________________________ >> gem5-users mailing list -- gem5-users@gem5.org >> To unsubscribe send an email to gem5-users-le...@gem5.org >> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s >> >>
_______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s