[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-22 Thread Daniel Gerzhoy via gem5-users
Sounds good.

I'll generate a patch for the staging branch in the next few days.

Dan

On Mon, Jun 22, 2020 at 2:50 PM Matt Sinclair 
wrote:

> In my opinion, adding support for HSA_PACKET_TYPE_AGENT_DISPATCH,
> irrelevant of the current issues, is worthwhile and helpful to push.
>
> If you have a minimum working example of how you change the benchmark,
> that would be helpful too.
>
> Kyle R. has spent a bunch of time trying to identify the source of the
> problem within the synchronize call, but thus far we haven't found anything
> concrete.  So for now, having this workaround would definitely be helpful
> for the community.
>
> Matt
>
> On Mon, Jun 22, 2020 at 1:25 PM Daniel Gerzhoy 
> wrote:
>
>> Hey Matt,
>>
>> Happy to do that if you think it's viable, but I have to say my
>> workaround is pretty hack-y. There are definitely some benchmark changes on
>> top of changes to the simulator.
>>
>> Let me describe it for you and then if you still think it's a good idea
>> I'll make a patch.
>>
>> My workaround relies on the fact that:
>> 1. Launching a kernel sets up a completion signal that
>> hipDeviceSynchronize() ultimately waits on.
>> 2. All you need in the benchmark is that completion signal to know that
>> your kernel is complete.
>>
>> So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in
>> hsa_packet_processor.cc and gpu_command_processor.cc to receive commands
>> from the benchmark directly.
>> One of the commands is to steal the completion signal for a particular
>> kernel and pass it back to the benchmark.
>>
>> After you launch the kernel (normally) you pass in that kernel's id (you
>> have to keep track) then send a command to steal the completion signal. It
>> gets passed back in the return_address member of the agent packet.
>>
>> In the benchmark I store that signal and use it to do a
>> hsa_signal_wait_relaxed on it.
>> And you have to do this every time you launch a kernel, you could
>> conceivably overload the hipDeviceSynchronize() function to do this for
>> you/keep track of kernel launches too.
>>
>> Let me know if you think this is still something you guys want.
>>
>> Cheers,
>>
>> Dan
>>
>>
>> On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair 
>> wrote:
>>
>>> Hi Dan,
>>>
>>> Do you mind pushing your workaround with the completion signal as a
>>> patch to the staging branch so we can take a look?  Or is this just a
>>> change to the program(s) itself?
>>>
>>> After Kyle's fix (which has been pushed as an update to my patch), we're
>>> still seeing some hipDeviceSynchronize failures.  So we're interested in
>>> looking at what you did to see if it solves the problem.
>>>
>>> Matt
>>>
>>> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty  wrote:
>>>
 Hi Dan,

 Another thing to try is to add and set the environment variable HIP_DB
 in apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
 Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
 hipDeviceSynchronize() calls for the applications I've tested.

 I had traced through this issue (or at least one that manifests the
 same way) a while back, and what I remember is that the crash happens
 somewhere in the HIP code, and it occurs because somewhere much earlier we
 go down a codepath that doesn't clear a register (I believe that was also
 in HIP code). That register then gets re-used until the error propagates to
 the register used in the ld instruction. Unfortunately, I had a hard time
 of getting consistent, manageable traces, so I wasn't able to figure out
 why we were going down the wrong codepath.

 Kyle
 --
 *From:* mattdsincl...@gmail.com 
 *Sent:* Friday, June 19, 2020 2:08 PM
 *To:* Daniel Gerzhoy 
 *Cc:* GAURAV JAIN ; Kyle Roarty ;
 gem5 users mailing list 
 *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time

 Thanks Dan.  Kyle R. has found some things about the patch that we're
 testing and may need to be pushed pending those results.  Fingers crossed
 that fix will help you too.

 As Gaurav mentioned previously, the spin flag did not always solve the
 problem for us -- seems like that is true for you too, although I don't
 remember square ever failing for us.

 I don't know exactly where that PC is coming from, I'd have to get a
 trace.  But I suspect it's actually a GPU address being accessed by some
 instruction that's failing -- in the past when I've seen this kind of
 issue, it was happening because the kernel boundary was not being respected
 and code was running that shouldn't have been running yet.  I don't know
 what your use case is, so it's possible that is not the issue for you -- a
 trace would be the only way to know for sure.

 Matt

 Regards,
 Matt Sinclair
 Assistant Professor
 University of Wisconsin-Madison
 Computer Sciences Department
 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-22 Thread Matt Sinclair via gem5-users
In my opinion, adding support for HSA_PACKET_TYPE_AGENT_DISPATCH,
irrelevant of the current issues, is worthwhile and helpful to push.

If you have a minimum working example of how you change the benchmark, that
would be helpful too.

Kyle R. has spent a bunch of time trying to identify the source of the
problem within the synchronize call, but thus far we haven't found anything
concrete.  So for now, having this workaround would definitely be helpful
for the community.

Matt

On Mon, Jun 22, 2020 at 1:25 PM Daniel Gerzhoy 
wrote:

> Hey Matt,
>
> Happy to do that if you think it's viable, but I have to say my workaround
> is pretty hack-y. There are definitely some benchmark changes on top of
> changes to the simulator.
>
> Let me describe it for you and then if you still think it's a good idea
> I'll make a patch.
>
> My workaround relies on the fact that:
> 1. Launching a kernel sets up a completion signal that
> hipDeviceSynchronize() ultimately waits on.
> 2. All you need in the benchmark is that completion signal to know that
> your kernel is complete.
>
> So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in
> hsa_packet_processor.cc and gpu_command_processor.cc to receive commands
> from the benchmark directly.
> One of the commands is to steal the completion signal for a particular
> kernel and pass it back to the benchmark.
>
> After you launch the kernel (normally) you pass in that kernel's id (you
> have to keep track) then send a command to steal the completion signal. It
> gets passed back in the return_address member of the agent packet.
>
> In the benchmark I store that signal and use it to do a
> hsa_signal_wait_relaxed on it.
> And you have to do this every time you launch a kernel, you could
> conceivably overload the hipDeviceSynchronize() function to do this for
> you/keep track of kernel launches too.
>
> Let me know if you think this is still something you guys want.
>
> Cheers,
>
> Dan
>
>
> On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair 
> wrote:
>
>> Hi Dan,
>>
>> Do you mind pushing your workaround with the completion signal as a patch
>> to the staging branch so we can take a look?  Or is this just a change to
>> the program(s) itself?
>>
>> After Kyle's fix (which has been pushed as an update to my patch), we're
>> still seeing some hipDeviceSynchronize failures.  So we're interested in
>> looking at what you did to see if it solves the problem.
>>
>> Matt
>>
>> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty  wrote:
>>
>>> Hi Dan,
>>>
>>> Another thing to try is to add and set the environment variable HIP_DB
>>> in apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
>>> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
>>> hipDeviceSynchronize() calls for the applications I've tested.
>>>
>>> I had traced through this issue (or at least one that manifests the same
>>> way) a while back, and what I remember is that the crash happens somewhere
>>> in the HIP code, and it occurs because somewhere much earlier we go down a
>>> codepath that doesn't clear a register (I believe that was also in HIP
>>> code). That register then gets re-used until the error propagates to the
>>> register used in the ld instruction. Unfortunately, I had a hard time of
>>> getting consistent, manageable traces, so I wasn't able to figure out why
>>> we were going down the wrong codepath.
>>>
>>> Kyle
>>> --
>>> *From:* mattdsincl...@gmail.com 
>>> *Sent:* Friday, June 19, 2020 2:08 PM
>>> *To:* Daniel Gerzhoy 
>>> *Cc:* GAURAV JAIN ; Kyle Roarty ;
>>> gem5 users mailing list 
>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>>
>>> Thanks Dan.  Kyle R. has found some things about the patch that we're
>>> testing and may need to be pushed pending those results.  Fingers crossed
>>> that fix will help you too.
>>>
>>> As Gaurav mentioned previously, the spin flag did not always solve the
>>> problem for us -- seems like that is true for you too, although I don't
>>> remember square ever failing for us.
>>>
>>> I don't know exactly where that PC is coming from, I'd have to get a
>>> trace.  But I suspect it's actually a GPU address being accessed by some
>>> instruction that's failing -- in the past when I've seen this kind of
>>> issue, it was happening because the kernel boundary was not being respected
>>> and code was running that shouldn't have been running yet.  I don't know
>>> what your use case is, so it's possible that is not the issue for you -- a
>>> trace would be the only way to know for sure.
>>>
>>> Matt
>>>
>>> Regards,
>>> Matt Sinclair
>>> Assistant Professor
>>> University of Wisconsin-Madison
>>> Computer Sciences Department
>>> cs.wisc.edu/~sinclair
>>>
>>> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy <
>>> daniel.gerz...@gmail.com> wrote:
>>>
>>> Hey Matt,
>>>
>>> Thanks for pushing those changes. I updated the head of the amd staging
>>> branch and tried to run square. The time to get into main stays 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-22 Thread Daniel Gerzhoy via gem5-users
Hey Matt,

Happy to do that if you think it's viable, but I have to say my workaround
is pretty hack-y. There are definitely some benchmark changes on top of
changes to the simulator.

Let me describe it for you and then if you still think it's a good idea
I'll make a patch.

My workaround relies on the fact that:
1. Launching a kernel sets up a completion signal that
hipDeviceSynchronize() ultimately waits on.
2. All you need in the benchmark is that completion signal to know that
your kernel is complete.

So I basically implement the HSA_PACKET_TYPE_AGENT_DISPATCH in
hsa_packet_processor.cc and gpu_command_processor.cc to receive commands
from the benchmark directly.
One of the commands is to steal the completion signal for a particular
kernel and pass it back to the benchmark.

After you launch the kernel (normally) you pass in that kernel's id (you
have to keep track) then send a command to steal the completion signal. It
gets passed back in the return_address member of the agent packet.

In the benchmark I store that signal and use it to do a
hsa_signal_wait_relaxed on it.
And you have to do this every time you launch a kernel, you could
conceivably overload the hipDeviceSynchronize() function to do this for
you/keep track of kernel launches too.

Let me know if you think this is still something you guys want.

Cheers,

Dan


On Mon, Jun 22, 2020 at 2:04 PM Matt Sinclair 
wrote:

> Hi Dan,
>
> Do you mind pushing your workaround with the completion signal as a patch
> to the staging branch so we can take a look?  Or is this just a change to
> the program(s) itself?
>
> After Kyle's fix (which has been pushed as an update to my patch), we're
> still seeing some hipDeviceSynchronize failures.  So we're interested in
> looking at what you did to see if it solves the problem.
>
> Matt
>
> On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty  wrote:
>
>> Hi Dan,
>>
>> Another thing to try is to add and set the environment variable HIP_DB in
>> apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
>> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
>> hipDeviceSynchronize() calls for the applications I've tested.
>>
>> I had traced through this issue (or at least one that manifests the same
>> way) a while back, and what I remember is that the crash happens somewhere
>> in the HIP code, and it occurs because somewhere much earlier we go down a
>> codepath that doesn't clear a register (I believe that was also in HIP
>> code). That register then gets re-used until the error propagates to the
>> register used in the ld instruction. Unfortunately, I had a hard time of
>> getting consistent, manageable traces, so I wasn't able to figure out why
>> we were going down the wrong codepath.
>>
>> Kyle
>> --
>> *From:* mattdsincl...@gmail.com 
>> *Sent:* Friday, June 19, 2020 2:08 PM
>> *To:* Daniel Gerzhoy 
>> *Cc:* GAURAV JAIN ; Kyle Roarty ;
>> gem5 users mailing list 
>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>
>> Thanks Dan.  Kyle R. has found some things about the patch that we're
>> testing and may need to be pushed pending those results.  Fingers crossed
>> that fix will help you too.
>>
>> As Gaurav mentioned previously, the spin flag did not always solve the
>> problem for us -- seems like that is true for you too, although I don't
>> remember square ever failing for us.
>>
>> I don't know exactly where that PC is coming from, I'd have to get a
>> trace.  But I suspect it's actually a GPU address being accessed by some
>> instruction that's failing -- in the past when I've seen this kind of
>> issue, it was happening because the kernel boundary was not being respected
>> and code was running that shouldn't have been running yet.  I don't know
>> what your use case is, so it's possible that is not the issue for you -- a
>> trace would be the only way to know for sure.
>>
>> Matt
>>
>> Regards,
>> Matt Sinclair
>> Assistant Professor
>> University of Wisconsin-Madison
>> Computer Sciences Department
>> cs.wisc.edu/~sinclair
>>
>> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy 
>> wrote:
>>
>> Hey Matt,
>>
>> Thanks for pushing those changes. I updated the head of the amd staging
>> branch and tried to run square. The time to get into main stays about the
>> same (5min) FYI.
>>
>> But the hipDeviceSynchronize() fails even when I add 
>> hipSetDeviceFlags(hipDeviceScheduleSpin);
>> unfortunately.
>>
>>  panic: Tried to read unmapped address 0x1853e78.
>> PC: 0x752a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]
>>
>> Is that PC (0x752a966b( somewhere in the hip code or something?
>> r in the emulated driver? The line between the simulator code and
>> guest code is kind of blurry to me around there haha.
>>
>> Best,
>>
>> Dan
>>
>> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair 
>> wrote:
>>
>> Hi Dan,
>>
>> Thanks for the update.  Apologies for the delay, the patch didn't apply
>> cleanly initially, but I have pushed the 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-22 Thread Matt Sinclair via gem5-users
Hi Dan,

Do you mind pushing your workaround with the completion signal as a patch
to the staging branch so we can take a look?  Or is this just a change to
the program(s) itself?

After Kyle's fix (which has been pushed as an update to my patch), we're
still seeing some hipDeviceSynchronize failures.  So we're interested in
looking at what you did to see if it solves the problem.

Matt

On Fri, Jun 19, 2020 at 4:15 PM Kyle Roarty  wrote:

> Hi Dan,
>
> Another thing to try is to add and set the environment variable HIP_DB in
> apu_se.py (Line 461 for me, starts with "env = ['LD_LIBRARY_PATH...") .
> Setting HIP_DB=sync or HIP_DB=api has prevented crashing on
> hipDeviceSynchronize() calls for the applications I've tested.
>
> I had traced through this issue (or at least one that manifests the same
> way) a while back, and what I remember is that the crash happens somewhere
> in the HIP code, and it occurs because somewhere much earlier we go down a
> codepath that doesn't clear a register (I believe that was also in HIP
> code). That register then gets re-used until the error propagates to the
> register used in the ld instruction. Unfortunately, I had a hard time of
> getting consistent, manageable traces, so I wasn't able to figure out why
> we were going down the wrong codepath.
>
> Kyle
> --
> *From:* mattdsincl...@gmail.com 
> *Sent:* Friday, June 19, 2020 2:08 PM
> *To:* Daniel Gerzhoy 
> *Cc:* GAURAV JAIN ; Kyle Roarty ; gem5
> users mailing list 
> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>
> Thanks Dan.  Kyle R. has found some things about the patch that we're
> testing and may need to be pushed pending those results.  Fingers crossed
> that fix will help you too.
>
> As Gaurav mentioned previously, the spin flag did not always solve the
> problem for us -- seems like that is true for you too, although I don't
> remember square ever failing for us.
>
> I don't know exactly where that PC is coming from, I'd have to get a
> trace.  But I suspect it's actually a GPU address being accessed by some
> instruction that's failing -- in the past when I've seen this kind of
> issue, it was happening because the kernel boundary was not being respected
> and code was running that shouldn't have been running yet.  I don't know
> what your use case is, so it's possible that is not the issue for you -- a
> trace would be the only way to know for sure.
>
> Matt
>
> Regards,
> Matt Sinclair
> Assistant Professor
> University of Wisconsin-Madison
> Computer Sciences Department
> cs.wisc.edu/~sinclair
>
> On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy 
> wrote:
>
> Hey Matt,
>
> Thanks for pushing those changes. I updated the head of the amd staging
> branch and tried to run square. The time to get into main stays about the
> same (5min) FYI.
>
> But the hipDeviceSynchronize() fails even when I add 
> hipSetDeviceFlags(hipDeviceScheduleSpin);
> unfortunately.
>
>  panic: Tried to read unmapped address 0x1853e78.
> PC: 0x752a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]
>
> Is that PC (0x752a966b( somewhere in the hip code or something?
> r in the emulated driver? The line between the simulator code and
> guest code is kind of blurry to me around there haha.
>
> Best,
>
> Dan
>
> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair 
> wrote:
>
> Hi Dan,
>
> Thanks for the update.  Apologies for the delay, the patch didn't apply
> cleanly initially, but I have pushed the patch I promised previously.
> Since I'm not sure if you're on the develop branch or the AMD staging
> branch, I pushed it to both (there are some differences in code on the
> branches, which I hope will be resolved over time as more of the commits
> from the staging branch are pushed to develop:
>
> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335
>
> I have validated that both of them compile, and asked Kyle R to test that
> both of them a) don't break anything that is expected to work publicly with
> the GPU and b) hopefully resolve some of the problems (like yours) with
> barrier synchronization.  Let us know if this solves your problem too --
> fingers crossed.
>
> Thanks,
> Matt
>
> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy 
> wrote:
>
>   Matt,
>
> It wasn't so much a solution as an explanation. Kyle was running on an r5
> 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>
> The relative difference in clock speed seems to me to be a more reasonable
> explanation for a slowdown from 1-1.5 minutes to ~5min (actual time before
> min) than the 8 min (time before main + exit time) I was seeing before.
>
> I'll update to the latest branch and see if that speeds me up further. I'm
> also going to try running on a faster machine as well though that will take
> some setup-time.
>
> Gaurav,
>
> Thanks for the tip, that will be helpful in the meantime.
>
> Dan
>
> On Fri, 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-19 Thread Matt Sinclair via gem5-users
Thanks Dan.  Kyle R. has found some things about the patch that we're
testing and may need to be pushed pending those results.  Fingers crossed
that fix will help you too.

As Gaurav mentioned previously, the spin flag did not always solve the
problem for us -- seems like that is true for you too, although I don't
remember square ever failing for us.

I don't know exactly where that PC is coming from, I'd have to get a
trace.  But I suspect it's actually a GPU address being accessed by some
instruction that's failing -- in the past when I've seen this kind of
issue, it was happening because the kernel boundary was not being respected
and code was running that shouldn't have been running yet.  I don't know
what your use case is, so it's possible that is not the issue for you -- a
trace would be the only way to know for sure.

Matt

Regards,
Matt Sinclair
Assistant Professor
University of Wisconsin-Madison
Computer Sciences Department
cs.wisc.edu/~sinclair

On Wed, Jun 17, 2020 at 10:30 AM Daniel Gerzhoy 
wrote:

> Hey Matt,
>
> Thanks for pushing those changes. I updated the head of the amd staging
> branch and tried to run square. The time to get into main stays about the
> same (5min) FYI.
>
> But the hipDeviceSynchronize() fails even when I add 
> hipSetDeviceFlags(hipDeviceScheduleSpin);
> unfortunately.
>
>  panic: Tried to read unmapped address 0x1853e78.
> PC: 0x752a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]
>
> Is that PC (0x752a966b( somewhere in the hip code or something?
> r in the emulated driver? The line between the simulator code and
> guest code is kind of blurry to me around there haha.
>
> Best,
>
> Dan
>
> On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair 
> wrote:
>
>> Hi Dan,
>>
>> Thanks for the update.  Apologies for the delay, the patch didn't apply
>> cleanly initially, but I have pushed the patch I promised previously.
>> Since I'm not sure if you're on the develop branch or the AMD staging
>> branch, I pushed it to both (there are some differences in code on the
>> branches, which I hope will be resolved over time as more of the commits
>> from the staging branch are pushed to develop:
>>
>> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
>> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335
>>
>> I have validated that both of them compile, and asked Kyle R to test that
>> both of them a) don't break anything that is expected to work publicly with
>> the GPU and b) hopefully resolve some of the problems (like yours) with
>> barrier synchronization.  Let us know if this solves your problem too --
>> fingers crossed.
>>
>> Thanks,
>> Matt
>>
>> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy 
>> wrote:
>>
>>>   Matt,
>>>
>>> It wasn't so much a solution as an explanation. Kyle was running on an
>>> r5 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>>>
>>> The relative difference in clock speed seems to me to be a more
>>> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual
>>> time before min) than the 8 min (time before main + exit time) I was seeing
>>> before.
>>>
>>> I'll update to the latest branch and see if that speeds me up further.
>>> I'm also going to try running on a faster machine as well though that will
>>> take some setup-time.
>>>
>>> Gaurav,
>>>
>>> Thanks for the tip, that will be helpful in the meantime.
>>>
>>> Dan
>>>
>>> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN  wrote:
>>>
 Hi,

 I am not sure if chiming in now would cause any more confusion, but
 still giving it a try.

 @Daniel Gerzhoy  - for hipDeviceSynchronize,
 as Matt mentioned, they are working on a fix and should have it out there.
 If you want to, can you try this:

 hipSetDeviceFlags(hipDeviceScheduleSpin);
 for (int k = 1; k < dim; k++) {
 hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads,
 0, 0);
 hipDeviceSynchronize();
 }

 For me, in many cases (not all and in the ones which it didn't work, I
 got the same error unmapped error as you), this seemed like doing the
 trick. You should checkout the HEAD and then try this. I am not hoping for
 it to make any difference but still worth a shot.


 --
 *From:* mattdsincl...@gmail.com 
 *Sent:* Friday, June 12, 2020 2:14 PM
 *To:* Daniel Gerzhoy 
 *Cc:* Kyle Roarty ; GAURAV JAIN ;
 gem5 users mailing list 
 *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time

 Hi Dan,

 Glad to hear things are working, and thanks for the tips!  I must admit
 to not quite following what the solution was though -- are you saying the
 solution is to replace exit(0)/return with m5_exit()?  I thought your
 original post said the problem was things taking a really long time before
 main?  If so, it would seem like something else must have been the

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-17 Thread Daniel Gerzhoy via gem5-users
Hey Matt,

Thanks for pushing those changes. I updated the head of the amd staging
branch and tried to run square. The time to get into main stays about the
same (5min) FYI.

But the hipDeviceSynchronize() fails even when I add
hipSetDeviceFlags(hipDeviceScheduleSpin);
unfortunately.

 panic: Tried to read unmapped address 0x1853e78.
PC: 0x752a966b, Instr:   MOV_R_M : ld   rdi, DS:[rbx + 0x8]

Is that PC (0x752a966b( somewhere in the hip code or something?
r in the emulated driver? The line between the simulator code and
guest code is kind of blurry to me around there haha.

Best,

Dan

On Mon, Jun 15, 2020 at 9:59 PM Matt Sinclair 
wrote:

> Hi Dan,
>
> Thanks for the update.  Apologies for the delay, the patch didn't apply
> cleanly initially, but I have pushed the patch I promised previously.
> Since I'm not sure if you're on the develop branch or the AMD staging
> branch, I pushed it to both (there are some differences in code on the
> branches, which I hope will be resolved over time as more of the commits
> from the staging branch are pushed to develop:
>
> - develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
> - AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335
>
> I have validated that both of them compile, and asked Kyle R to test that
> both of them a) don't break anything that is expected to work publicly with
> the GPU and b) hopefully resolve some of the problems (like yours) with
> barrier synchronization.  Let us know if this solves your problem too --
> fingers crossed.
>
> Thanks,
> Matt
>
> On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy 
> wrote:
>
>>   Matt,
>>
>> It wasn't so much a solution as an explanation. Kyle was running on an r5
>> 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>>
>> The relative difference in clock speed seems to me to be a more
>> reasonable explanation for a slowdown from 1-1.5 minutes to ~5min (actual
>> time before min) than the 8 min (time before main + exit time) I was seeing
>> before.
>>
>> I'll update to the latest branch and see if that speeds me up further.
>> I'm also going to try running on a faster machine as well though that will
>> take some setup-time.
>>
>> Gaurav,
>>
>> Thanks for the tip, that will be helpful in the meantime.
>>
>> Dan
>>
>> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN  wrote:
>>
>>> Hi,
>>>
>>> I am not sure if chiming in now would cause any more confusion, but
>>> still giving it a try.
>>>
>>> @Daniel Gerzhoy  - for hipDeviceSynchronize,
>>> as Matt mentioned, they are working on a fix and should have it out there.
>>> If you want to, can you try this:
>>>
>>> hipSetDeviceFlags(hipDeviceScheduleSpin);
>>> for (int k = 1; k < dim; k++) {
>>> hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads,
>>> 0, 0);
>>> hipDeviceSynchronize();
>>> }
>>>
>>> For me, in many cases (not all and in the ones which it didn't work, I
>>> got the same error unmapped error as you), this seemed like doing the
>>> trick. You should checkout the HEAD and then try this. I am not hoping for
>>> it to make any difference but still worth a shot.
>>>
>>>
>>> --
>>> *From:* mattdsincl...@gmail.com 
>>> *Sent:* Friday, June 12, 2020 2:14 PM
>>> *To:* Daniel Gerzhoy 
>>> *Cc:* Kyle Roarty ; GAURAV JAIN ;
>>> gem5 users mailing list 
>>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>>
>>> Hi Dan,
>>>
>>> Glad to hear things are working, and thanks for the tips!  I must admit
>>> to not quite following what the solution was though -- are you saying the
>>> solution is to replace exit(0)/return with m5_exit()?  I thought your
>>> original post said the problem was things taking a really long time before
>>> main?  If so, it would seem like something else must have been the
>>> problem/solution?
>>>
>>> Coming to your other questions: I don't recall what exactly the root
>>> cause of the hipDeviceSynchronize failure is, but I would definitely
>>> recommend updating to the current staging branch head first and testing.  I
>>> am also hoping to push a fix today to the barrier bit synchronization --
>>> most of the hipDeviceSynchronize-type failures I've seen were due to a bug
>>> in my barrier bit implementation.  I'm not sure if this will be the
>>> solution to your problem or not, but I can definitely add you as a reviewer
>>> and/or point you to it if needed.
>>>
>>> Not sure about the m5op, hopefully someone else can chime in on that.
>>>
>>> Thanks,
>>> Matt
>>>
>>> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy <
>>> daniel.gerz...@gmail.com> wrote:
>>>
>>> I've figured it out.
>>>
>>> To measure the time it took to get to main() I put a *return 0; *at the
>>> beginning of the function so I wouldn't have to babysit it.
>>>
>>> I didn't consider that it would also take some time for the simulator to
>>> exit, which is where the extra few minutes comes from.
>>> Side-note: *m5_exit(0);* instead of a 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-15 Thread Matt Sinclair via gem5-users
Hi Dan,

Thanks for the update.  Apologies for the delay, the patch didn't apply
cleanly initially, but I have pushed the patch I promised previously.
Since I'm not sure if you're on the develop branch or the AMD staging
branch, I pushed it to both (there are some differences in code on the
branches, which I hope will be resolved over time as more of the commits
from the staging branch are pushed to develop:

- develop: https://gem5-review.googlesource.com/c/public/gem5/+/30354
- AMD staging: https://gem5-review.googlesource.com/c/amd/gem5/+/30335

I have validated that both of them compile, and asked Kyle R to test that
both of them a) don't break anything that is expected to work publicly with
the GPU and b) hopefully resolve some of the problems (like yours) with
barrier synchronization.  Let us know if this solves your problem too --
fingers crossed.

Thanks,
Matt

On Fri, Jun 12, 2020 at 2:47 PM Daniel Gerzhoy 
wrote:

>   Matt,
>
> It wasn't so much a solution as an explanation. Kyle was running on an r5
> 3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)
>
> The relative difference in clock speed seems to me to be a more reasonable
> explanation for a slowdown from 1-1.5 minutes to ~5min (actual time before
> min) than the 8 min (time before main + exit time) I was seeing before.
>
> I'll update to the latest branch and see if that speeds me up further. I'm
> also going to try running on a faster machine as well though that will take
> some setup-time.
>
> Gaurav,
>
> Thanks for the tip, that will be helpful in the meantime.
>
> Dan
>
> On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN  wrote:
>
>> Hi,
>>
>> I am not sure if chiming in now would cause any more confusion, but still
>> giving it a try.
>>
>> @Daniel Gerzhoy  - for hipDeviceSynchronize,
>> as Matt mentioned, they are working on a fix and should have it out there.
>> If you want to, can you try this:
>>
>> hipSetDeviceFlags(hipDeviceScheduleSpin);
>> for (int k = 1; k < dim; k++) {
>> hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0,
>> 0);
>> hipDeviceSynchronize();
>> }
>>
>> For me, in many cases (not all and in the ones which it didn't work, I
>> got the same error unmapped error as you), this seemed like doing the
>> trick. You should checkout the HEAD and then try this. I am not hoping for
>> it to make any difference but still worth a shot.
>>
>>
>> --
>> *From:* mattdsincl...@gmail.com 
>> *Sent:* Friday, June 12, 2020 2:14 PM
>> *To:* Daniel Gerzhoy 
>> *Cc:* Kyle Roarty ; GAURAV JAIN ;
>> gem5 users mailing list 
>> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>>
>> Hi Dan,
>>
>> Glad to hear things are working, and thanks for the tips!  I must admit
>> to not quite following what the solution was though -- are you saying the
>> solution is to replace exit(0)/return with m5_exit()?  I thought your
>> original post said the problem was things taking a really long time before
>> main?  If so, it would seem like something else must have been the
>> problem/solution?
>>
>> Coming to your other questions: I don't recall what exactly the root
>> cause of the hipDeviceSynchronize failure is, but I would definitely
>> recommend updating to the current staging branch head first and testing.  I
>> am also hoping to push a fix today to the barrier bit synchronization --
>> most of the hipDeviceSynchronize-type failures I've seen were due to a bug
>> in my barrier bit implementation.  I'm not sure if this will be the
>> solution to your problem or not, but I can definitely add you as a reviewer
>> and/or point you to it if needed.
>>
>> Not sure about the m5op, hopefully someone else can chime in on that.
>>
>> Thanks,
>> Matt
>>
>> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy 
>> wrote:
>>
>> I've figured it out.
>>
>> To measure the time it took to get to main() I put a *return 0; *at the
>> beginning of the function so I wouldn't have to babysit it.
>>
>> I didn't consider that it would also take some time for the simulator to
>> exit, which is where the extra few minutes comes from.
>> Side-note: *m5_exit(0);* instead of a return exits immediately.
>>
>> 5 min is a bit more reasonable of a slowdown for the difference between
>> the two clocks.
>>
>> Two incidental things:
>>
>> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps
>> while it's printing stuff?
>> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes
>> (panic: Tried to read unmapped address 0xffc29f48.). Has this been
>> fixed since?
>>
>> I'm going to update to the head of this branch soon, and eventually to
>> the main branch. If it hasn't been fixed I've created a workaround by
>> stealing the completion signal of the kernel based on its launch id, and
>> manually waiting for it using the HSA interface.
>> Happy to help out and implement this as a m5op (or something) if that
>> would be helpful for you guys.
>>
>> Best,
>>
>> 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-12 Thread Daniel Gerzhoy via gem5-users
  Matt,

It wasn't so much a solution as an explanation. Kyle was running on an r5
3600 (3.6-4.2 GHz) whereas I am on a Xeon Gold 5117 @ (2.0 - 2.8 GHz)

The relative difference in clock speed seems to me to be a more reasonable
explanation for a slowdown from 1-1.5 minutes to ~5min (actual time before
min) than the 8 min (time before main + exit time) I was seeing before.

I'll update to the latest branch and see if that speeds me up further. I'm
also going to try running on a faster machine as well though that will take
some setup-time.

Gaurav,

Thanks for the tip, that will be helpful in the meantime.

Dan

On Fri, Jun 12, 2020 at 3:41 PM GAURAV JAIN  wrote:

> Hi,
>
> I am not sure if chiming in now would cause any more confusion, but still
> giving it a try.
>
> @Daniel Gerzhoy  - for hipDeviceSynchronize, as
> Matt mentioned, they are working on a fix and should have it out there. If
> you want to, can you try this:
>
> hipSetDeviceFlags(hipDeviceScheduleSpin);
> for (int k = 1; k < dim; k++) {
> hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0,
> 0);
> hipDeviceSynchronize();
> }
>
> For me, in many cases (not all and in the ones which it didn't work, I got
> the same error unmapped error as you), this seemed like doing the trick.
> You should checkout the HEAD and then try this. I am not hoping for it to
> make any difference but still worth a shot.
>
>
> --
> *From:* mattdsincl...@gmail.com 
> *Sent:* Friday, June 12, 2020 2:14 PM
> *To:* Daniel Gerzhoy 
> *Cc:* Kyle Roarty ; GAURAV JAIN ; gem5
> users mailing list 
> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>
> Hi Dan,
>
> Glad to hear things are working, and thanks for the tips!  I must admit to
> not quite following what the solution was though -- are you saying the
> solution is to replace exit(0)/return with m5_exit()?  I thought your
> original post said the problem was things taking a really long time before
> main?  If so, it would seem like something else must have been the
> problem/solution?
>
> Coming to your other questions: I don't recall what exactly the root cause
> of the hipDeviceSynchronize failure is, but I would definitely recommend
> updating to the current staging branch head first and testing.  I am also
> hoping to push a fix today to the barrier bit synchronization -- most of
> the hipDeviceSynchronize-type failures I've seen were due to a bug in my
> barrier bit implementation.  I'm not sure if this will be the solution to
> your problem or not, but I can definitely add you as a reviewer and/or
> point you to it if needed.
>
> Not sure about the m5op, hopefully someone else can chime in on that.
>
> Thanks,
> Matt
>
> On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy 
> wrote:
>
> I've figured it out.
>
> To measure the time it took to get to main() I put a *return 0; *at the
> beginning of the function so I wouldn't have to babysit it.
>
> I didn't consider that it would also take some time for the simulator to
> exit, which is where the extra few minutes comes from.
> Side-note: *m5_exit(0);* instead of a return exits immediately.
>
> 5 min is a bit more reasonable of a slowdown for the difference between
> the two clocks.
>
> Two incidental things:
>
> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps while
> it's printing stuff?
> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes
> (panic: Tried to read unmapped address 0xffc29f48.). Has this been
> fixed since?
>
> I'm going to update to the head of this branch soon, and eventually to the
> main branch. If it hasn't been fixed I've created a workaround by stealing
> the completion signal of the kernel based on its launch id, and manually
> waiting for it using the HSA interface.
> Happy to help out and implement this as a m5op (or something) if that
> would be helpful for you guys.
>
> Best,
>
> Dan
>
> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair 
> wrote:
>
> I don't see anything amazingly amiss in your output, but the number of
> times the open/etc. fail is interesting -- Kyle do we see the same thing?
> If not, it could be that you should update your apu_se.py to point to the
> "correct" place to search for the libraries first?
>
> Also, based on Kyle's reply, Dan how long does it take you to boot up
> square?  Certainly a slower machine might take longer, but it does seem
> even slower than expected.  But if we're trying the same application, maybe
> it will be easier to spot differences.
>
> I would also recommend updating to the latest commit on the staging branch
> -- I don't believe it should break anything with those patches.
>
> Yes, looks like you are using the release version of ROCm -- no issues
> there.
>
> Matt
>
>
>
> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
> wrote:
>
> I am using the docker, yeah.
> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
> GHz) which might make up some of 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-12 Thread GAURAV JAIN via gem5-users
Hi,

I am not sure if chiming in now would cause any more confusion, but still 
giving it a try.

@Daniel Gerzhoy - for hipDeviceSynchronize, as 
Matt mentioned, they are working on a fix and should have it out there. If you 
want to, can you try this:

hipSetDeviceFlags(hipDeviceScheduleSpin);
for (int k = 1; k < dim; k++) {
hipLaunchKernelGGL(HIP_KERNEL_NAME(somekernel), grid, threads, 0, 0);
hipDeviceSynchronize();
}

For me, in many cases (not all and in the ones which it didn't work, I got the 
same error unmapped error as you), this seemed like doing the trick. You should 
checkout the HEAD and then try this. I am not hoping for it to make any 
difference but still worth a shot.



From: mattdsincl...@gmail.com 
Sent: Friday, June 12, 2020 2:14 PM
To: Daniel Gerzhoy 
Cc: Kyle Roarty ; GAURAV JAIN ; gem5 users 
mailing list 
Subject: Re: [gem5-users] GCN3 GPU Simulation Start-Up Time

Hi Dan,

Glad to hear things are working, and thanks for the tips!  I must admit to not 
quite following what the solution was though -- are you saying the solution is 
to replace exit(0)/return with m5_exit()?  I thought your original post said 
the problem was things taking a really long time before main?  If so, it would 
seem like something else must have been the problem/solution?

Coming to your other questions: I don't recall what exactly the root cause of 
the hipDeviceSynchronize failure is, but I would definitely recommend updating 
to the current staging branch head first and testing.  I am also hoping to push 
a fix today to the barrier bit synchronization -- most of the 
hipDeviceSynchronize-type failures I've seen were due to a bug in my barrier 
bit implementation.  I'm not sure if this will be the solution to your problem 
or not, but I can definitely add you as a reviewer and/or point you to it if 
needed.

Not sure about the m5op, hopefully someone else can chime in on that.

Thanks,
Matt

On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy 
mailto:daniel.gerz...@gmail.com>> wrote:
I've figured it out.

To measure the time it took to get to main() I put a return 0; at the beginning 
of the function so I wouldn't have to babysit it.

I didn't consider that it would also take some time for the simulator to exit, 
which is where the extra few minutes comes from.
Side-note: m5_exit(0); instead of a return exits immediately.

5 min is a bit more reasonable of a slowdown for the difference between the two 
clocks.

Two incidental things:

1. Is there a way to have gem5 spit out (real wall-clock) timestamps while it's 
printing stuff?
2. A while ago I asked about hipDeviceSynchronize(); causing crashes (panic: 
Tried to read unmapped address 0xffc29f48.). Has this been fixed since?

I'm going to update to the head of this branch soon, and eventually to the main 
branch. If it hasn't been fixed I've created a workaround by stealing the 
completion signal of the kernel based on its launch id, and manually waiting 
for it using the HSA interface.
Happy to help out and implement this as a m5op (or something) if that would be 
helpful for you guys.

Best,

Dan

On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair 
mailto:mattdsincl...@gmail.com>> wrote:
I don't see anything amazingly amiss in your output, but the number of times 
the open/etc. fail is interesting -- Kyle do we see the same thing?  If not, it 
could be that you should update your apu_se.py to point to the "correct" place 
to search for the libraries first?

Also, based on Kyle's reply, Dan how long does it take you to boot up square?  
Certainly a slower machine might take longer, but it does seem even slower than 
expected.  But if we're trying the same application, maybe it will be easier to 
spot differences.

I would also recommend updating to the latest commit on the staging branch -- I 
don't believe it should break anything with those patches.

Yes, looks like you are using the release version of ROCm -- no issues there.

Matt



On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
mailto:daniel.gerz...@gmail.com>> wrote:
I am using the docker, yeah.
It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8 GHz) 
which might make up some of the difference, the r5 3600 has a faster clock 
(3.6-4.2 GHz).

I've hesitated to update my branch because in the Dockerfile it specifically 
checks this branch out and applies a patch, though the patch isn't very 
extensive.
This was from a while back (November maybe?) and I know you guys have been 
integrating things into the main branch (thanks!)
I was thinking I would wait until it's fully merged into the mainline gem5 
branch and rebase onto that and try to merge my changes in.

Last I checked the GCN3 stuff is in the dev branch not the master right?

But if it will help maybe I should update to the head of this branch. Will I 
need to update the docker as well?

As for the debug vs release rocm I think 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-12 Thread Matt Sinclair via gem5-users
Hi Dan,

Glad to hear things are working, and thanks for the tips!  I must admit to
not quite following what the solution was though -- are you saying the
solution is to replace exit(0)/return with m5_exit()?  I thought your
original post said the problem was things taking a really long time before
main?  If so, it would seem like something else must have been the
problem/solution?

Coming to your other questions: I don't recall what exactly the root cause
of the hipDeviceSynchronize failure is, but I would definitely recommend
updating to the current staging branch head first and testing.  I am also
hoping to push a fix today to the barrier bit synchronization -- most of
the hipDeviceSynchronize-type failures I've seen were due to a bug in my
barrier bit implementation.  I'm not sure if this will be the solution to
your problem or not, but I can definitely add you as a reviewer and/or
point you to it if needed.

Not sure about the m5op, hopefully someone else can chime in on that.

Thanks,
Matt

On Fri, Jun 12, 2020 at 12:12 PM Daniel Gerzhoy 
wrote:

> I've figured it out.
>
> To measure the time it took to get to main() I put a *return 0; *at the
> beginning of the function so I wouldn't have to babysit it.
>
> I didn't consider that it would also take some time for the simulator to
> exit, which is where the extra few minutes comes from.
> Side-note: *m5_exit(0);* instead of a return exits immediately.
>
> 5 min is a bit more reasonable of a slowdown for the difference between
> the two clocks.
>
> Two incidental things:
>
> 1. Is there a way to have gem5 spit out (real wall-clock) timestamps while
> it's printing stuff?
> 2. A while ago I asked about hipDeviceSynchronize(); causing crashes
> (panic: Tried to read unmapped address 0xffc29f48.). Has this been
> fixed since?
>
> I'm going to update to the head of this branch soon, and eventually to the
> main branch. If it hasn't been fixed I've created a workaround by stealing
> the completion signal of the kernel based on its launch id, and manually
> waiting for it using the HSA interface.
> Happy to help out and implement this as a m5op (or something) if that
> would be helpful for you guys.
>
> Best,
>
> Dan
>
> On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair 
> wrote:
>
>> I don't see anything amazingly amiss in your output, but the number of
>> times the open/etc. fail is interesting -- Kyle do we see the same thing?
>> If not, it could be that you should update your apu_se.py to point to the
>> "correct" place to search for the libraries first?
>>
>> Also, based on Kyle's reply, Dan how long does it take you to boot up
>> square?  Certainly a slower machine might take longer, but it does seem
>> even slower than expected.  But if we're trying the same application, maybe
>> it will be easier to spot differences.
>>
>> I would also recommend updating to the latest commit on the staging
>> branch -- I don't believe it should break anything with those patches.
>>
>> Yes, looks like you are using the release version of ROCm -- no issues
>> there.
>>
>> Matt
>>
>>
>>
>> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
>> wrote:
>>
>>> I am using the docker, yeah.
>>> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 -
>>> 2.8 GHz) which might make up some of the difference, the r5 3600 has a
>>> faster clock (3.6-4.2 GHz).
>>>
>>> I've hesitated to update my branch because in the Dockerfile it
>>> specifically checks this branch out and applies a patch, though the patch
>>> isn't very extensive.
>>> This was from a while back (November maybe?) and I know you guys have
>>> been integrating things into the main branch (thanks!)
>>> I was thinking I would wait until it's fully merged into the mainline
>>> gem5 branch and rebase onto that and try to merge my changes in.
>>>
>>> Last I checked the GCN3 stuff is in the dev branch not the master right?
>>>
>>> But if it will help maybe I should update to the head of this branch.
>>> Will I need to update the docker as well?
>>>
>>> As for the debug vs release rocm I think I'm using the release version.
>>> This is what the dockerfile built:
>>>
>>> ARG rocm_ver=1.6.2
>>> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>>>  \
>>> | tar -xjv \
>>> && cd apt_${rocm_ver}/pool/main/ \
>>> && dpkg -i h/hsakmt-roct-dev/* \
>>> && dpkg -i h/hsa-ext-rocr-dev/* \
>>> && dpkg -i h/hsa-rocr-dev/* \
>>> && dpkg -i r/rocm-utils/* \
>>> && dpkg -i h/hcc/* \
>>> && dpkg -i h/hip_base/* \
>>> && dpkg -i h/hip_hcc/* \
>>> && dpkg -i h/hip_samples/*
>>>
>>>
>>> I ran a benchmark that prints that it entered main and returns
>>> immediately, this took 9 minutes.
>>> I've attached a debug trace with debug flags =
>>> "GPUDriver,SyscallVerbose"
>>> There's a lot of weird things going on, "syscall open: failed", "syscall
>>> brk: break point changed to [...]", and lots of ignored system 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-12 Thread Daniel Gerzhoy via gem5-users
I've figured it out.

To measure the time it took to get to main() I put a *return 0; *at the
beginning of the function so I wouldn't have to babysit it.

I didn't consider that it would also take some time for the simulator to
exit, which is where the extra few minutes comes from.
Side-note: *m5_exit(0);* instead of a return exits immediately.

5 min is a bit more reasonable of a slowdown for the difference between the
two clocks.

Two incidental things:

1. Is there a way to have gem5 spit out (real wall-clock) timestamps while
it's printing stuff?
2. A while ago I asked about hipDeviceSynchronize(); causing crashes
(panic: Tried to read unmapped address 0xffc29f48.). Has this been
fixed since?

I'm going to update to the head of this branch soon, and eventually to the
main branch. If it hasn't been fixed I've created a workaround by stealing
the completion signal of the kernel based on its launch id, and manually
waiting for it using the HSA interface.
Happy to help out and implement this as a m5op (or something) if that would
be helpful for you guys.

Best,

Dan

On Thu, Jun 11, 2020 at 12:40 PM Matt Sinclair 
wrote:

> I don't see anything amazingly amiss in your output, but the number of
> times the open/etc. fail is interesting -- Kyle do we see the same thing?
> If not, it could be that you should update your apu_se.py to point to the
> "correct" place to search for the libraries first?
>
> Also, based on Kyle's reply, Dan how long does it take you to boot up
> square?  Certainly a slower machine might take longer, but it does seem
> even slower than expected.  But if we're trying the same application, maybe
> it will be easier to spot differences.
>
> I would also recommend updating to the latest commit on the staging branch
> -- I don't believe it should break anything with those patches.
>
> Yes, looks like you are using the release version of ROCm -- no issues
> there.
>
> Matt
>
>
>
> On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
> wrote:
>
>> I am using the docker, yeah.
>> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
>> GHz) which might make up some of the difference, the r5 3600 has a faster
>> clock (3.6-4.2 GHz).
>>
>> I've hesitated to update my branch because in the Dockerfile it
>> specifically checks this branch out and applies a patch, though the patch
>> isn't very extensive.
>> This was from a while back (November maybe?) and I know you guys have
>> been integrating things into the main branch (thanks!)
>> I was thinking I would wait until it's fully merged into the mainline
>> gem5 branch and rebase onto that and try to merge my changes in.
>>
>> Last I checked the GCN3 stuff is in the dev branch not the master right?
>>
>> But if it will help maybe I should update to the head of this branch.
>> Will I need to update the docker as well?
>>
>> As for the debug vs release rocm I think I'm using the release version.
>> This is what the dockerfile built:
>>
>> ARG rocm_ver=1.6.2
>> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>>  \
>> | tar -xjv \
>> && cd apt_${rocm_ver}/pool/main/ \
>> && dpkg -i h/hsakmt-roct-dev/* \
>> && dpkg -i h/hsa-ext-rocr-dev/* \
>> && dpkg -i h/hsa-rocr-dev/* \
>> && dpkg -i r/rocm-utils/* \
>> && dpkg -i h/hcc/* \
>> && dpkg -i h/hip_base/* \
>> && dpkg -i h/hip_hcc/* \
>> && dpkg -i h/hip_samples/*
>>
>>
>> I ran a benchmark that prints that it entered main and returns
>> immediately, this took 9 minutes.
>> I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
>> There's a lot of weird things going on, "syscall open: failed", "syscall
>> brk: break point changed to [...]", and lots of ignored system calls.
>>
>> head of Stats for reference:
>> -- Begin Simulation Statistics --
>> sim_seconds  0.096192
>>   # Number of seconds simulated
>> sim_ticks 96192368500
>>   # Number of ticks simulated
>> final_tick96192368500
>>   # Number of ticks from beginning of simulation (restored from checkpoints
>> and never reset)
>> sim_freq 1
>> # Frequency of simulated ticks
>> host_inst_rate 175209
>>   # Simulator instruction rate (inst/s)
>> host_op_rate   338409
>>   # Simulator op (including micro ops) rate (op/s)
>> host_tick_rate  175362515
>>   # Simulator tick rate (ticks/s)
>> host_mem_usage1628608
>>   # Number of bytes of host memory used
>> host_seconds   548.53
>>   # Real time elapsed on the host
>> sim_insts96108256
>>   # Number of instructions simulated
>> sim_ops 185628785
>>   # Number of 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Matt Sinclair via gem5-users
I don't see anything amazingly amiss in your output, but the number of
times the open/etc. fail is interesting -- Kyle do we see the same thing?
If not, it could be that you should update your apu_se.py to point to the
"correct" place to search for the libraries first?

Also, based on Kyle's reply, Dan how long does it take you to boot up
square?  Certainly a slower machine might take longer, but it does seem
even slower than expected.  But if we're trying the same application, maybe
it will be easier to spot differences.

I would also recommend updating to the latest commit on the staging branch
-- I don't believe it should break anything with those patches.

Yes, looks like you are using the release version of ROCm -- no issues
there.

Matt



On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
wrote:

> I am using the docker, yeah.
> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
> GHz) which might make up some of the difference, the r5 3600 has a faster
> clock (3.6-4.2 GHz).
>
> I've hesitated to update my branch because in the Dockerfile it
> specifically checks this branch out and applies a patch, though the patch
> isn't very extensive.
> This was from a while back (November maybe?) and I know you guys have been
> integrating things into the main branch (thanks!)
> I was thinking I would wait until it's fully merged into the mainline gem5
> branch and rebase onto that and try to merge my changes in.
>
> Last I checked the GCN3 stuff is in the dev branch not the master right?
>
> But if it will help maybe I should update to the head of this branch. Will
> I need to update the docker as well?
>
> As for the debug vs release rocm I think I'm using the release version.
> This is what the dockerfile built:
>
> ARG rocm_ver=1.6.2
> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>  \
> | tar -xjv \
> && cd apt_${rocm_ver}/pool/main/ \
> && dpkg -i h/hsakmt-roct-dev/* \
> && dpkg -i h/hsa-ext-rocr-dev/* \
> && dpkg -i h/hsa-rocr-dev/* \
> && dpkg -i r/rocm-utils/* \
> && dpkg -i h/hcc/* \
> && dpkg -i h/hip_base/* \
> && dpkg -i h/hip_hcc/* \
> && dpkg -i h/hip_samples/*
>
>
> I ran a benchmark that prints that it entered main and returns
> immediately, this took 9 minutes.
> I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
> There's a lot of weird things going on, "syscall open: failed", "syscall
> brk: break point changed to [...]", and lots of ignored system calls.
>
> head of Stats for reference:
> -- Begin Simulation Statistics --
> sim_seconds  0.096192
>   # Number of seconds simulated
> sim_ticks 96192368500
>   # Number of ticks simulated
> final_tick96192368500
>   # Number of ticks from beginning of simulation (restored from checkpoints
> and never reset)
> sim_freq 1
>   # Frequency of simulated ticks
> host_inst_rate 175209
>   # Simulator instruction rate (inst/s)
> host_op_rate   338409
>   # Simulator op (including micro ops) rate (op/s)
> host_tick_rate  175362515
>   # Simulator tick rate (ticks/s)
> host_mem_usage1628608
>   # Number of bytes of host memory used
> host_seconds   548.53
>   # Real time elapsed on the host
> sim_insts96108256
>   # Number of instructions simulated
> sim_ops 185628785
>   # Number of ops (including micro ops) simulated
> system.voltage_domain.voltage   1
>   # Voltage in Volts
> system.clk_domain.clock  1000
>   # Clock period in ticks
>
> Maybe something in the attached file explains it better than I can express.
>
> Many thanks for your help and hard work!
>
> Dan
>
>
>
>
>
> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty  wrote:
>
>> Running through a few applications, it took me about 2.5 minutes or less
>> each time using docker to start executing the program on an r5 3600.
>>
>> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
>> took about 1-1.5 mins.
>>
>> I ran conv_bench and rnn_bench from DeepBench which took just about 2
>> minutes.
>>
>> Because of that, it's possible the size of the app has an effect on setup
>> time, as the HIP apps are extremely small.
>>
>> Also, the commit Dan is checked out on is d0945dc
>> 
>>  mem-ruby:
>> add cache hit/miss statistics for TCP and TCC
>> ,
>> which isn't the most recent commit. I don't believe that that would account
>> for such a 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Daniel Gerzhoy via gem5-users
I am using the docker, yeah.
It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
GHz) which might make up some of the difference, the r5 3600 has a faster
clock (3.6-4.2 GHz).

I've hesitated to update my branch because in the Dockerfile it
specifically checks this branch out and applies a patch, though the patch
isn't very extensive.
This was from a while back (November maybe?) and I know you guys have been
integrating things into the main branch (thanks!)
I was thinking I would wait until it's fully merged into the mainline gem5
branch and rebase onto that and try to merge my changes in.

Last I checked the GCN3 stuff is in the dev branch not the master right?

But if it will help maybe I should update to the head of this branch. Will
I need to update the docker as well?

As for the debug vs release rocm I think I'm using the release version.
This is what the dockerfile built:

ARG rocm_ver=1.6.2
RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2 \
| tar -xjv \
&& cd apt_${rocm_ver}/pool/main/ \
&& dpkg -i h/hsakmt-roct-dev/* \
&& dpkg -i h/hsa-ext-rocr-dev/* \
&& dpkg -i h/hsa-rocr-dev/* \
&& dpkg -i r/rocm-utils/* \
&& dpkg -i h/hcc/* \
&& dpkg -i h/hip_base/* \
&& dpkg -i h/hip_hcc/* \
&& dpkg -i h/hip_samples/*


I ran a benchmark that prints that it entered main and returns immediately,
this took 9 minutes.
I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
There's a lot of weird things going on, "syscall open: failed", "syscall
brk: break point changed to [...]", and lots of ignored system calls.

head of Stats for reference:
-- Begin Simulation Statistics --
sim_seconds  0.096192
# Number of seconds simulated
sim_ticks 96192368500
# Number of ticks simulated
final_tick96192368500
# Number of ticks from beginning of simulation (restored from checkpoints
and never reset)
sim_freq 1
  # Frequency of simulated ticks
host_inst_rate 175209
# Simulator instruction rate (inst/s)
host_op_rate   338409
# Simulator op (including micro ops) rate (op/s)
host_tick_rate  175362515
# Simulator tick rate (ticks/s)
host_mem_usage1628608
# Number of bytes of host memory used
host_seconds   548.53
# Real time elapsed on the host
sim_insts96108256
# Number of instructions simulated
sim_ops 185628785
# Number of ops (including micro ops) simulated
system.voltage_domain.voltage   1
# Voltage in Volts
system.clk_domain.clock  1000
# Clock period in ticks

Maybe something in the attached file explains it better than I can express.

Many thanks for your help and hard work!

Dan





On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty  wrote:

> Running through a few applications, it took me about 2.5 minutes or less
> each time using docker to start executing the program on an r5 3600.
>
> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
> took about 1-1.5 mins.
>
> I ran conv_bench and rnn_bench from DeepBench which took just about 2
> minutes.
>
> Because of that, it's possible the size of the app has an effect on setup
> time, as the HIP apps are extremely small.
>
> Also, the commit Dan is checked out on is d0945dc
> 
>  mem-ruby:
> add cache hit/miss statistics for TCP and TCC
> ,
> which isn't the most recent commit. I don't believe that that would account
> for such a large slowdown, but it doesn't hurt to try the newest commit
> unless it breaks something.
>
> Kyle
> --
> *From:* mattdsincl...@gmail.com 
> *Sent:* Thursday, June 11, 2020 1:15 AM
> *To:* gem5 users mailing list 
> *Cc:* Daniel Gerzhoy ; GAURAV JAIN <
> gja...@wisc.edu>; Kyle Roarty 
> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>
> Gaurav & Kyle, do you know if this is the case?
>
> Dan, I believe the short answer is yes although 7-8 minutes seems a little
> long.  Are you running this in Kyle's Docker, or separately?  If in the
> Docker, that does increase the overhead somewhat, so running it directly on
> a system would likely reduce the overhead somewhat.  Also, are you running
> with the release or debug version of the ROCm drivers?  Again, debug
> version will likely add some time to this.
>
> Matt
>
> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
> gem5-users@gem5.org> wrote:
>
> I've been running simulations using the GCN3 branch:
>
> rocm_ver=1.6.2
> $git branch
>* (HEAD 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Matt Sinclair via gem5-users
Gaurav & Kyle, do you know if this is the case?

Dan, I believe the short answer is yes although 7-8 minutes seems a little
long.  Are you running this in Kyle's Docker, or separately?  If in the
Docker, that does increase the overhead somewhat, so running it directly on
a system would likely reduce the overhead somewhat.  Also, are you running
with the release or debug version of the ROCm drivers?  Again, debug
version will likely add some time to this.

Matt

On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
gem5-users@gem5.org> wrote:

> I've been running simulations using the GCN3 branch:
>
> rocm_ver=1.6.2
> $git branch
>* (HEAD detached at d0945dc)
>   agutierr/master-gcn3-staging
>
> And I've noticed that it takes roughly 7-8 minutes to get to main()
>
> I'm guessing that this is the simulator setting up drivers?
> Is that correct? Is there other stuff going on?
>
> *Has anyone found a way to speed this up? *
>
> I am trying to get some of the rodinia benchmarks from the HIP-Examples
> running and debugging takes a long time as a result.
>
> I suspect that this is unavoidable but I won't know if I don't ask!
>
> Cheers,
>
> Dan Gerzhoy
> ___
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s