[gem5-users] Re: Debugging Gem5 Segmentation Fault in x86 Decoder

2020-06-11 Thread Saileshwar, Gururaj via gem5-users
Thank you Gabe.

I have created a bug report for the stack-corruption at this link : 
https://gem5.atlassian.net/projects/GEM5/issues/GEM5-631

On Jira, I have also uploaded the Gem5.opt binary compiled without tcmalloc, 
and a sample benchmark where the seg-fault occurs, with instructions to run it. 
For me, the seg-fault occurs within about 30 seconds of the real-simulation 
starting with O3CPU.

If you or anyone else can help figure out the cause of the memory corruption, 
that would be great !

Thanks,
Gururaj




From: Gabe Black 
Sent: Wednesday, June 10, 2020 11:47 PM
To: Saileshwar, Gururaj ; gem5 users mailing list 
; Brad Beckmann ; Tony Gutierrez 

Subject: Re: [gem5-users] Re: Debugging Gem5 Segmentation Fault in x86 Decoder

Actually, could you file a bug for this over on Jira?

https://gem5.atlassian.net/secure/BrowseProjects.jspa

I'm not sure what limits it has on file size, etc., but that might be a good 
place to upload the binary you're trying to run.

Gabe

On Wed, Jun 10, 2020 at 8:41 PM Gabe Black 
mailto:gabebl...@google.com>> wrote:
cc-ing a couple AMD folks in case they have some input or want to know that 
there's a potential bug here.

Ignoring the last error which I addressed in a different email, this looks like 
some sort of (gem5) stack corruption. Note that this doesn't directly have 
anything to do with the stack in the simulated system getting corrupted. My 
best guess is that a temporary EmulEnv is being constructed either directly on 
the stack, or in some automatically allocated heap memory who's pointer is on 
the stack. Then the constructor is called, something bad happens, and when the 
function attempts to return and clean up the temporary the pointer/object is 
corrupted and gem5 eats itself a little bit.

Can you put your binary somewhere so I can try it? How long does it take to 
cause the segfault?

Gabe

On Wed, Jun 10, 2020 at 5:07 PM Saileshwar, Gururaj 
mailto:gurura...@gatech.edu>> wrote:
Hi Gabe,

Below is the error-report from valgrind. From what I understand, there are 
errors in the decoder because of uninitialized values in the Machine-Inst input 
to decodeInst().

There is also an error reported in mem_state.cc, where the stack size is 
increased. But I could not find anything out of the normal when I used Debug 
flags for that module to print debugging messages from that file. I do not know 
if it is related.

The leakage report for Valgrind is occupies a lot of space (several MBs). 
Please let me know if you need that to understand what is happening. I can 
upload it and send you a link.

Thanks,
Gururaj
---

==5823== ERROR SUMMARY: 1329 errors from 62 contexts (suppressed: 1066 from 55)
==5823==
==5823== 1 errors in context 1 of 62:
==5823== Invalid read of size 8
==5823==at 0x658DD9F: _Unwind_Resume (in 
/lib/x86_64-linux-gnu/libgcc_s.so.1)
==5823==by 0xBE3E94: X86ISA::Decoder::decodeInst(X86ISA::ExtMachInst) 
(decode-method.cc.inc:20696)
==5823==by 0xB4CCE3: X86ISA::Decoder::decode(X86ISA::ExtMachInst, unsigned 
long) (decoder.cc:686)
==5823==by 0xB4CFDA: X86ISA::Decoder::decode(X86ISA::PCState&) 
(decoder.cc:731)
==5823==by 0x72DEEE: DefaultFetch::fetch(bool&) 
(fetch_impl.hh:1297)
==5823==by 0x72F27A: DefaultFetch::tick() (fetch_impl.hh:930)
==5823==by 0x706A9A: FullO3CPU::tick() (cpu.cc:531)
==5823==by 0x8330F9: operator() (std_function.h:706)
==5823==by 0x8330F9: process (eventq.hh:1050)
==5823==by 0x8330F9: EventQueue::serviceOne() (eventq.cc:221)
==5823==by 0x856883: doSimLoop(EventQueue*) (simulate.cc:216)
==5823==by 0x8579DC: simulate(unsigned long) (simulate.cc:129)
==5823==by 0x14B5C1D: void 
pybind11::cpp_function::initialize(GlobalSimLoopExitEvent* (*&)(uns
igned long), GlobalSimLoopExitEvent* (*)(unsigned long), pybind11::name const&, 
pybind11::scope const&, pybind11::sibling cons
t&, pybind11::arg_v 
const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call&)
 (cast.h:1935
)
==5823==by 0x7F20E0: pybind11::cpp_function::dispatcher(_object*, _object*, 
_object*) (pybind11.h:628)
==5823==  Address 0xc5 is not stack'd, malloc'd or (recently) free'd


==5823== 1 errors in context 2 of 62:
==5823== Invalid free() / delete / delete[] / realloc()
==5823==at 0x4C3123B: operator delete(void*) (in 
/usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==5823==by 0xBE3E8C: X86ISA::Decoder::decodeInst(X86ISA::ExtMachInst) 
(decode-method.cc.inc:20696)
==5823==by 0xB4CCE3: X86ISA::Decoder::decode(X86ISA::ExtMachInst, unsigned 
long) (decoder.cc:686)
==5823==by 0xB4CFDA: X86ISA::Decoder::decode(X86ISA::PCState&) 
(decoder.cc:731)
==5823==by 0x72DEEE: DefaultFetch::fetch(bool&) 
(fetch_impl.hh:1297)
==5823==by 0x72F27A: DefaultFetch::tick() (fetch_impl.hh:930)
==5823==by 0x706A9A: FullO3CPU::tick() (cpu.cc:531)
==5823==by 0x8330F9: operator() (std_function.h:706)
==5823==by 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Matt Sinclair via gem5-users
I don't see anything amazingly amiss in your output, but the number of
times the open/etc. fail is interesting -- Kyle do we see the same thing?
If not, it could be that you should update your apu_se.py to point to the
"correct" place to search for the libraries first?

Also, based on Kyle's reply, Dan how long does it take you to boot up
square?  Certainly a slower machine might take longer, but it does seem
even slower than expected.  But if we're trying the same application, maybe
it will be easier to spot differences.

I would also recommend updating to the latest commit on the staging branch
-- I don't believe it should break anything with those patches.

Yes, looks like you are using the release version of ROCm -- no issues
there.

Matt



On Thu, Jun 11, 2020 at 9:38 AM Daniel Gerzhoy 
wrote:

> I am using the docker, yeah.
> It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
> GHz) which might make up some of the difference, the r5 3600 has a faster
> clock (3.6-4.2 GHz).
>
> I've hesitated to update my branch because in the Dockerfile it
> specifically checks this branch out and applies a patch, though the patch
> isn't very extensive.
> This was from a while back (November maybe?) and I know you guys have been
> integrating things into the main branch (thanks!)
> I was thinking I would wait until it's fully merged into the mainline gem5
> branch and rebase onto that and try to merge my changes in.
>
> Last I checked the GCN3 stuff is in the dev branch not the master right?
>
> But if it will help maybe I should update to the head of this branch. Will
> I need to update the docker as well?
>
> As for the debug vs release rocm I think I'm using the release version.
> This is what the dockerfile built:
>
> ARG rocm_ver=1.6.2
> RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2
>  \
> | tar -xjv \
> && cd apt_${rocm_ver}/pool/main/ \
> && dpkg -i h/hsakmt-roct-dev/* \
> && dpkg -i h/hsa-ext-rocr-dev/* \
> && dpkg -i h/hsa-rocr-dev/* \
> && dpkg -i r/rocm-utils/* \
> && dpkg -i h/hcc/* \
> && dpkg -i h/hip_base/* \
> && dpkg -i h/hip_hcc/* \
> && dpkg -i h/hip_samples/*
>
>
> I ran a benchmark that prints that it entered main and returns
> immediately, this took 9 minutes.
> I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
> There's a lot of weird things going on, "syscall open: failed", "syscall
> brk: break point changed to [...]", and lots of ignored system calls.
>
> head of Stats for reference:
> -- Begin Simulation Statistics --
> sim_seconds  0.096192
>   # Number of seconds simulated
> sim_ticks 96192368500
>   # Number of ticks simulated
> final_tick96192368500
>   # Number of ticks from beginning of simulation (restored from checkpoints
> and never reset)
> sim_freq 1
>   # Frequency of simulated ticks
> host_inst_rate 175209
>   # Simulator instruction rate (inst/s)
> host_op_rate   338409
>   # Simulator op (including micro ops) rate (op/s)
> host_tick_rate  175362515
>   # Simulator tick rate (ticks/s)
> host_mem_usage1628608
>   # Number of bytes of host memory used
> host_seconds   548.53
>   # Real time elapsed on the host
> sim_insts96108256
>   # Number of instructions simulated
> sim_ops 185628785
>   # Number of ops (including micro ops) simulated
> system.voltage_domain.voltage   1
>   # Voltage in Volts
> system.clk_domain.clock  1000
>   # Clock period in ticks
>
> Maybe something in the attached file explains it better than I can express.
>
> Many thanks for your help and hard work!
>
> Dan
>
>
>
>
>
> On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty  wrote:
>
>> Running through a few applications, it took me about 2.5 minutes or less
>> each time using docker to start executing the program on an r5 3600.
>>
>> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
>> took about 1-1.5 mins.
>>
>> I ran conv_bench and rnn_bench from DeepBench which took just about 2
>> minutes.
>>
>> Because of that, it's possible the size of the app has an effect on setup
>> time, as the HIP apps are extremely small.
>>
>> Also, the commit Dan is checked out on is d0945dc
>> 
>>  mem-ruby:
>> add cache hit/miss statistics for TCP and TCC
>> ,
>> which isn't the most recent commit. I don't believe that that would account
>> for such a 

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Daniel Gerzhoy via gem5-users
I am using the docker, yeah.
It's running on our server cluster which is a Xeon Gold 5117 @ (2.0 - 2.8
GHz) which might make up some of the difference, the r5 3600 has a faster
clock (3.6-4.2 GHz).

I've hesitated to update my branch because in the Dockerfile it
specifically checks this branch out and applies a patch, though the patch
isn't very extensive.
This was from a while back (November maybe?) and I know you guys have been
integrating things into the main branch (thanks!)
I was thinking I would wait until it's fully merged into the mainline gem5
branch and rebase onto that and try to merge my changes in.

Last I checked the GCN3 stuff is in the dev branch not the master right?

But if it will help maybe I should update to the head of this branch. Will
I need to update the docker as well?

As for the debug vs release rocm I think I'm using the release version.
This is what the dockerfile built:

ARG rocm_ver=1.6.2
RUN wget -qO- repo.radeon.com/rocm/archive/apt_${rocm_ver}.tar.bz2 \
| tar -xjv \
&& cd apt_${rocm_ver}/pool/main/ \
&& dpkg -i h/hsakmt-roct-dev/* \
&& dpkg -i h/hsa-ext-rocr-dev/* \
&& dpkg -i h/hsa-rocr-dev/* \
&& dpkg -i r/rocm-utils/* \
&& dpkg -i h/hcc/* \
&& dpkg -i h/hip_base/* \
&& dpkg -i h/hip_hcc/* \
&& dpkg -i h/hip_samples/*


I ran a benchmark that prints that it entered main and returns immediately,
this took 9 minutes.
I've attached a debug trace with debug flags = "GPUDriver,SyscallVerbose"
There's a lot of weird things going on, "syscall open: failed", "syscall
brk: break point changed to [...]", and lots of ignored system calls.

head of Stats for reference:
-- Begin Simulation Statistics --
sim_seconds  0.096192
# Number of seconds simulated
sim_ticks 96192368500
# Number of ticks simulated
final_tick96192368500
# Number of ticks from beginning of simulation (restored from checkpoints
and never reset)
sim_freq 1
  # Frequency of simulated ticks
host_inst_rate 175209
# Simulator instruction rate (inst/s)
host_op_rate   338409
# Simulator op (including micro ops) rate (op/s)
host_tick_rate  175362515
# Simulator tick rate (ticks/s)
host_mem_usage1628608
# Number of bytes of host memory used
host_seconds   548.53
# Real time elapsed on the host
sim_insts96108256
# Number of instructions simulated
sim_ops 185628785
# Number of ops (including micro ops) simulated
system.voltage_domain.voltage   1
# Voltage in Volts
system.clk_domain.clock  1000
# Clock period in ticks

Maybe something in the attached file explains it better than I can express.

Many thanks for your help and hard work!

Dan





On Thu, Jun 11, 2020 at 3:32 AM Kyle Roarty  wrote:

> Running through a few applications, it took me about 2.5 minutes or less
> each time using docker to start executing the program on an r5 3600.
>
> I ran square, dynamic_shared, and MatrixTranspose (All from HIP) which
> took about 1-1.5 mins.
>
> I ran conv_bench and rnn_bench from DeepBench which took just about 2
> minutes.
>
> Because of that, it's possible the size of the app has an effect on setup
> time, as the HIP apps are extremely small.
>
> Also, the commit Dan is checked out on is d0945dc
> 
>  mem-ruby:
> add cache hit/miss statistics for TCP and TCC
> ,
> which isn't the most recent commit. I don't believe that that would account
> for such a large slowdown, but it doesn't hurt to try the newest commit
> unless it breaks something.
>
> Kyle
> --
> *From:* mattdsincl...@gmail.com 
> *Sent:* Thursday, June 11, 2020 1:15 AM
> *To:* gem5 users mailing list 
> *Cc:* Daniel Gerzhoy ; GAURAV JAIN <
> gja...@wisc.edu>; Kyle Roarty 
> *Subject:* Re: [gem5-users] GCN3 GPU Simulation Start-Up Time
>
> Gaurav & Kyle, do you know if this is the case?
>
> Dan, I believe the short answer is yes although 7-8 minutes seems a little
> long.  Are you running this in Kyle's Docker, or separately?  If in the
> Docker, that does increase the overhead somewhat, so running it directly on
> a system would likely reduce the overhead somewhat.  Also, are you running
> with the release or debug version of the ROCm drivers?  Again, debug
> version will likely add some time to this.
>
> Matt
>
> On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
> gem5-users@gem5.org> wrote:
>
> I've been running simulations using the GCN3 branch:
>
> rocm_ver=1.6.2
> $git branch
>* (HEAD 

[gem5-users] Can't run multiprogram in se mode

2020-06-11 Thread Taiyu Zhou via gem5-users
Hi guys ! 
I am trying to run multiprogram in se mode but it can’t work. 

I run 
"
./build/X86/gem5.opt configs/example/se.py -c 
'/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello'
 --mem-size=8GB --cpu-type=DerivO3CPU --caches -n 2
“

Gem5 will not exit after running two hello program. And I try to solve it 
follow this  
https://www.mail-archive.com/gem5-users@gem5.org/msg17476.html 


Then two program works !
But can’t work for more than 2, for example 

“
./build/X86/gem5.opt configs/example/se.py -c 
'/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello'
 --mem-size=8GB --cpu-type=DerivO3CPU --caches -n 4

“

Gem5 will not exit after running four hello program.
"
gem5 Simulator System.  http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.

gem5 compiled Jun 11 2020 19:14:38
gem5 started Jun 11 2020 19:20:26
gem5 executing on ubuntu-16, pid 13634
command line: ./build/X86/gem5.opt configs/example/se.py -c 
'/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello;/home/ubuntu/taiyu/gem5-master/tests/test-progs/hello/bin/x86/linux/hello'
 --mem-size=8GB --cpu-type=DerivO3CPU --caches -n 4

Global frequency set at 1 ticks per second
0: system.remote_gdb: listening for remote gdb on port 7000
0: system.remote_gdb: listening for remote gdb on port 7001
0: system.remote_gdb: listening for remote gdb on port 7002
0: system.remote_gdb: listening for remote gdb on port 7003
 REAL SIMULATION 
info: Entering event queue @ 0.  Starting simulation...
Hello world!
Hello world!
Hello world!
Hello world!

"___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s

[gem5-users] Re: GCN3 GPU Simulation Start-Up Time

2020-06-11 Thread Matt Sinclair via gem5-users
Gaurav & Kyle, do you know if this is the case?

Dan, I believe the short answer is yes although 7-8 minutes seems a little
long.  Are you running this in Kyle's Docker, or separately?  If in the
Docker, that does increase the overhead somewhat, so running it directly on
a system would likely reduce the overhead somewhat.  Also, are you running
with the release or debug version of the ROCm drivers?  Again, debug
version will likely add some time to this.

Matt

On Wed, Jun 10, 2020 at 2:00 PM Daniel Gerzhoy via gem5-users <
gem5-users@gem5.org> wrote:

> I've been running simulations using the GCN3 branch:
>
> rocm_ver=1.6.2
> $git branch
>* (HEAD detached at d0945dc)
>   agutierr/master-gcn3-staging
>
> And I've noticed that it takes roughly 7-8 minutes to get to main()
>
> I'm guessing that this is the simulator setting up drivers?
> Is that correct? Is there other stuff going on?
>
> *Has anyone found a way to speed this up? *
>
> I am trying to get some of the rodinia benchmarks from the HIP-Examples
> running and debugging takes a long time as a result.
>
> I suspect that this is unavoidable but I won't know if I don't ask!
>
> Cheers,
>
> Dan Gerzhoy
> ___
> gem5-users mailing list -- gem5-users@gem5.org
> To unsubscribe send an email to gem5-users-le...@gem5.org
> %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s
___
gem5-users mailing list -- gem5-users@gem5.org
To unsubscribe send an email to gem5-users-le...@gem5.org
%(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s