Hi,

I am not sure if chiming in now would cause any more confusion, but still 
giving it a try.

@Daniel Gerzhoy<mailto:daniel.gerz...@gmail.com> - for hipDeviceSynchronize, as 
Matt mentioned, they are working on a fix and should have it out there. If you 
want to, can you try this:

    hipSetDeviceFlags(hipDeviceScheduleSpin);
    for (int k = 1; k < dim; k++) {
        hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0, 0);
        hipDeviceSynchronize();
    }

For me, in many cases (not all and in the ones which it didn't work, I got the 
same error unmapped error as you), this seemed like doing the trick. You should 
checkout the HEAD and then try this. I am not hoping for it to make any 
difference but still worth a shot.


________________________________
From: mattdsincl...@gmail.com <mattdsincl...@gmail.com>
Sent: Friday, June 12, 2020 2:14 PM
To: Daniel Gerzhoy <daniel.gerz...@gmail.com>
Cc: Kyle Roarty <kroa...@wisc.edu>; GAURAV JAIN <gja...@wisc.edu>; gem5 users 
mailing list <gem5-users@gem5.org>
Subject: Re: [gem5-users] GCN3 GPU Simulation Start-Up Time

Hi Dan,

Glad to hear things are working, and thanks for the tips!  I must admit to not 
quite following what the solution was though -- are you saying the solution is 
to replace exit(0)/return with m5_exit()?  I thought your original post said 
the problem was things taking a really long time before main?  If so, it would 
seem like something else must have been the problem/solution?

Coming to your other questions: I don't recall what exactly the root cause of 
the hipDeviceSynchronize failure is, but I would definitely recommend updating 
to the current staging branch head first and testing.  I am also hoping to push 
a fix today to the barrier bit synchronization -- most of the 
hipDeviceSynchronize-type failures I've seen were due to a bug in my barrier 
bit implementation.  I'm not sure if this will be the solution to your problem 
or not, but I can definitely add you as a reviewer and/or point you to it if 
needed.

Not sure about the m5op, hopefully someone else can chime in on that.

Thanks,
Matt

On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy 
<daniel.gerz...@gmail.com<mailto:daniel.gerz...@gmail.com>> wrote:
I've figured it out.

To measure the time it took to get to main() I put a return 0; at the beginning 
of the function so I wouldn't have to babysit it.

I didn't consider that it would also take some time for the simulator to exit, 
which is where the extra few minutes comes from.
Side-note: m5_exit(0); instead of a return exits immediately.

5 min is a bit more reasonable of a slowdown for the difference between the two 
clocks.

Two incidental things:

1. Is there a way to have gem5 spit out (real wall-clock) timestamps while it's 
printing stuff?
2. A while ago I asked about hipDeviceSynchronize(); causing crashes (panic: 
Tried to read unmapped address 0xff0000c29f48.). Has this been fixed since?

I'm going to update to the head of this branch soon, and eventually to the main 
branch. If it hasn't been fixed I've created a workaround by stealing the 
completion signal of the kernel based on its launch id, and manually waiting 
for it using the HSA interface.
Happy to help out and implement this as a m5op (or something) if that would be 
helpful for you guys.

Best,

Dan

On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair 
<mattdsincl...@gmail.com<mailto:mattdsincl...@gmail.com>> wrote:
I don't see anything amazingly amiss in your output, but the number of times 
the open/etc. fail is interesting -- Kyle do we see the same thing?  If not, it 
could be that you should update your apu_se.py to point to the "correct" place 
to search for the libraries first?

Also, based on Kyle's reply, Dan how long does it take you to boot up square?  
Certainly a slower machine might take longer, but it does seem even slower than 
expected.  But if we're trying the same application, maybe it will be easier to 
spot differences.

I would also recommend updating to the latest commit on the staging branch -- I 
don't believe it should break anything with those patches.

Yes, looks like you are using the release version of ROCm -- no issues there.

Matt



On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
<daniel.gerz...@gmail.com<mailto:daniel.gerz...@gmail.com>> wrote:
I am using the docker, yeah.
It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8 GHz) 
which might make up some of the difference, the r5 3600 has a faster clock 
(3.6-4.2 GHz).

I've hesitated to update my branch because in the Dockerfile it specifically 
checks this branch out and applies a patch, though the patch isn't very 
extensive.
This was from a while back (November maybe?) and I know you guys have been 
integrating things into the main branch (thanks!)
I was thinking I would wait until it's fully merged into the mainline gem5 
branch and rebase onto that and try to merge my changes in.

Last I checked the GCN3 stuff is in the dev branch not the master right?

But if it will help maybe I should update to the head of this branch. Will I 
need to update the docker as well?

As for the debug vs release rocm I think I'm using the release version. This is 
what the dockerfile built:

ARG rocm_ver=1.6.2
RUN wget -qO- 
repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2<http://repo.radeon.com/rocm/archive/apt_$%7Brocm_ver%7D.tar.bz2>
 \
    | tar -xjv \
    && cd apt_${rocm_ver}/pool/main/ \
    && dpkg -i h/hsakmt-roct-dev/* \
    && dpkg -i h/hsa-ext-rocr-dev/* \
    && dpkg -i h/hsa-rocr-dev/* \
    && dpkg -i r/rocm-utils/* \
    && dpkg -i h/hcc/* \
    && dpkg -i h/hip_base/* \
    && dpkg -i h/hip_hcc/* \
    && dpkg -i h/hip_samples/*


I ran a benchmark that prints that it entered main and returns immediately, 
this took 9 minutes.
I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
There's a lot of weird things going on, "syscall open: failed", "syscall brk: 
break point changed to [...]", and lots of ignored system calls.

head of Stats for reference:
---------- Begin Simulation Statistics ----------
sim_seconds                                  0.096192                       # 
Number of seconds simulated
sim_ticks                                 96192368500                       # 
Number of ticks simulated
final_tick                                96192368500                       # 
Number of ticks from beginning of simulation (restored from checkpoints and 
never reset)
sim_freq                                 1000000000000                       # 
Frequency of simulated ticks
host_inst_rate                                 175209                       # 
Simulator instruction rate (inst/s)
host_op_rate                                   338409                       # 
Simulator op (including micro ops) rate (op/s)
host_tick_rate                              175362515                       # 
Simulator tick rate (ticks/s)
host_mem_usage                                1628608                       # 
Number of bytes of host memory used
host_seconds                                   548.53                       # 
Real time elapsed on the host
sim_insts                                    96108256                       # 
Number of instructions simulated
sim_ops                                     185628785                       # 
Number of ops (including micro ops) simulated
system.voltage_domain.voltage                       1                       # 
Voltage in Volts
system.clk_domain.clock                          1000                       # 
Clock period in ticks

Maybe something in the attached file explains it better than I can express.

Many thanks for your help and hard work!

Dan





On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty 
<kroa...@wisc.edu<mailto:kroa...@wisc.edu>> wrote:
Running through a few applications, it took me about 2.5 minutes or less each 
time using docker to start executing the program on an r5 3600.

I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which took 
about 1-1.5 mins.

I ran conv_bench and rnn_bench from DeepBench which took just about 2 minutes.

Because of that, it's possible the size of the app has an effect on setup time, 
as the HIP apps are extremely small.

Also, the commit Dan is checked out on is 
d0945dc<https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>
 mem-ruby: add cache hit/miss statistics for TCP and 
TCC<https://gem5.googlesource.com/amd/gem5/+/d0945dc285cf146de160808d7e6d4c1fd3f73639>,
 which isn't the most recent commit. I don't believe that that would account 
for such a large slowdown, but it doesn't hurt to try the newest commit unless 
it breaks something.

Kyle
________________________________
From: mattdsincl...@gmail.com<mailto:mattdsincl...@gmail.com> 
<mattdsincl...@gmail.com<mailto:mattdsincl...@gmail.com>>
Sent: Thursday, June 11, 2020 1:15 AM
To: gem5 users mailing list <gem5-users@gem5.org<mailto:gem5-users@gem5.org>>
Cc: Daniel Gerzhoy <daniel.gerz...@gmail.com<mailto:daniel.gerz...@gmail.com>>; 
GAURAV JAIN <gja...@wisc.edu<mailto:gja...@wisc.edu>>; Kyle Roarty 
<kroa...@wisc.edu<mailto:kroa...@wisc.edu>>
Subject: Re: [gem5-users] GCN3 GPU Simulation Start-Up Time

Gaurav & Kyle, do you know if this is the case?

Dan, I believe the short answer is yes although 7-8 minutes seems a little 
long.  Are you running this in Kyle's Docker, or separately?  If in the Docker, 
that does increase the overhead somewhat, so running it directly on a system 
would likely reduce the overhead somewhat.  Also, are you running with the 
release or debug version of the ROCm drivers?  Again, debug version will likely 
add some time to this.

Matt

On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users 
<gem5-users@gem5.org<mailto:gem5-users@gem5.org>> wrote:
I've been running simulations using the GCN3 branch:

rocm_ver=1.6.2
$git branch
   * (HEAD detached at d0945dc)
      agutierr/master-gcn3-staging

And I've noticed that it takes roughly 7-8 minutes to get to main()

I'm guessing that this is the simulator setting up drivers?
Is that correct? Is there other stuff going on?

Has anyone found a way to speed this up?

I am trying to get some of the rodinia benchmarks from the HIP-Examples running 
and debugging takes a long time as a result.

I suspect that this is unavoidable but I won't know if I don't ask!

Cheers,

Dan Gerzhoy
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org<mailto:gem5-users@gem5.org>
To unsubscribe send an email to 
gem5-users-le...@gem5.org<mailto:gem5-users-le...@gem5.org>
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
_______________________________________________
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

Reply via email to