Hi Marcelo,

For future reference, if someone else has this issue... Another possibility
is that the branch predictor is the problem. It looks like it could be
predicting that instruction is a branch. I'm not sure if it's specifically
because of the compressed format or not, though. It's another place for the
next person to start digging.

Cheers,
Jason

On Fri, May 25, 2018 at 8:20 AM Marcelo Brandalero <mbrandal...@inf.ufrgs.br>
wrote:

> Hi Jason, Alec,
>
> Just to provide some feedback on this issue, it seems that the processor
> is mistakenly identifying (add reg, reg, reg) in compressed format as a
> branch instruction.
>
> I'm running a kernel that looks like this (result from 
> *riscv64-unknown-elf-objdump
> -D*)
>
> 000000000001019a <myFunction>:
>   1019a:       06400793                li      a5,100
>   1019e:       4701                    li      a4,0
>   101a0:       4681                    li      a3,0
>   101a2:       4601                    li      a2,0
>   101a4:       0c800513                li      a0,200
>   101a8:       952a                    add     a0,a0,a0
>   101aa:       9632                    add     a2,a2,a2
>   101ac:       96b6                    add     a3,a3,a3
>   101ae:       973a                    add     a4,a4,a4
>
>
>
>
> *   101b0:       952a                    add     a0,a0,a0   101b2:
>       9632                    add     a2,a2,a2   101b4:       96b6
>                    add     a3,a3,a3   101b6:       973a
>                    add     a4,a4,a4*(repeat the four instructions above
> until this:)
>   104b8:       952a                    add     a0,a0,a0
>   104ba:       9632                    add     a2,a2,a2
>   104bc:       96b6                    add     a3,a3,a3
>   104be:       973a                    add     a4,a4,a4
>   104c0:       952a                    add     a0,a0,a0
>   104c2:       2501                    sext.w  a0,a0
>   104c4:       9632                    add     a2,a2,a2
>   104c6:       2601                    sext.w  a2,a2
>   104c8:       96b6                    add     a3,a3,a3
>   104ca:       2681                    sext.w  a3,a3
>   104cc:       973a                    add     a4,a4,a4
>   104ce:       2701                    sext.w  a4,a4
>   104d0:       37fd                    addiw   a5,a5,-1
>   104d2:       cc079be3                bnez    a5,101a8 <myFunction+0xe>
>
> And what the Fetch stage looks like when fetching this code block is this:
>
> 4048968: system.cpu.fetch: [tid:0] Waking up from cache miss.
> 4048968: system.cpu.fetch: Running stage.
> 4048968: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4048968: system.cpu.fetch: [tid:0]: Icache miss is complete.
> 4048968: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4048968: system.cpu.fetch: [tid:0]: Instruction PC 0x101a8 (0) created
> [sn:8124].
> 4048968: system.cpu.fetch: [tid:0]: Instruction is: c_add a0, a0, a0
> 4048968: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4048968: system.cpu.fetch: Branch detected with PC =
> (0x101a8=>0x101aa).(0=>1)*
> 4048968: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4048968: system.cpu.fetch: [tid:0][sn:8124]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4049281: system.cpu.fetch: Running stage.
> 4049281: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4049281: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4049281: system.cpu.fetch: [tid:0]: Instruction PC 0x101aa (0) created
> [sn:8125].
> 4049281: system.cpu.fetch: [tid:0]: Instruction is: c_add a2, a2, a2
> 4049281: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4049281: system.cpu.fetch: Branch detected with PC =
> (0x101aa=>0x101ac).(0=>1)*
> 4049281: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4049281: system.cpu.fetch: [tid:0][sn:8125]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4049594: system.cpu.fetch: Running stage.
> 4049594: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4049594: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4049594: system.cpu.fetch: [tid:0]: Instruction PC 0x101ac (0) created
> [sn:8126].
> 4049594: system.cpu.fetch: [tid:0]: Instruction is: c_add a3, a3, a3
> 4049594: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4049594: system.cpu.fetch: Branch detected with PC =
> (0x101ac=>0x101ae).(0=>1)*
> 4049594: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4049594: system.cpu.fetch: [tid:0][sn:8126]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4049907: system.cpu.fetch: Running stage.
> 4049907: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4049907: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4049907: system.cpu.fetch: [tid:0]: Instruction PC 0x101ae (0) created
> [sn:8127].
> 4049907: system.cpu.fetch: [tid:0]: Instruction is: c_add a4, a4, a4
> 4049907: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4049907: system.cpu.fetch: Branch detected with PC =
> (0x101ae=>0x101b0).(0=>1)*
> 4049907: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4049907: system.cpu.fetch: [tid:0][sn:8127]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4050220: system.cpu.fetch: Running stage.
> 4050220: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4050220: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4050220: system.cpu.fetch: [tid:0]: Instruction PC 0x101b0 (0) created
> [sn:8128].
> 4050220: system.cpu.fetch: [tid:0]: Instruction is: c_add a0, a0, a0
> 4050220: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4050220: system.cpu.fetch: Branch detected with PC =
> (0x101b0=>0x101b2).(0=>1)*
> 4050220: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4050220: system.cpu.fetch: [tid:0][sn:8128]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4050533: system.cpu.fetch: Running stage.
> 4050533: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4050533: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4050533: system.cpu.fetch: [tid:0]: Instruction PC 0x101b2 (0) created
> [sn:8129].
> 4050533: system.cpu.fetch: [tid:0]: Instruction is: c_add a2, a2, a2
> 4050533: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4050533: system.cpu.fetch: Branch detected with PC =
> (0x101b2=>0x101b4).(0=>1)*
> 4050533: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4050533: system.cpu.fetch: [tid:0][sn:8129]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
> 4050846: system.cpu.fetch: Running stage.
> 4050846: system.cpu.fetch: Attempting to fetch from [tid:0]
> 4050846: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
> 4050846: system.cpu.fetch: [tid:0]: Instruction PC 0x101b4 (0) created
> [sn:8130].
> 4050846: system.cpu.fetch: [tid:0]: Instruction is: c_add a3, a3, a3
> 4050846: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
> *4050846: system.cpu.fetch: Branch detected with PC =
> (0x101b4=>0x101b6).(0=>1)*
> 4050846: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
> instruction encountered.
> 4050846: system.cpu.fetch: [tid:0][sn:8130]: Sending instruction to decode
> from fetch queue. Fetch queue size: 1.
>
> Not sure if it's a decoder problem or what, but it seems to affect only
> instructions in the compressed format. It manifests itself in the
> statistics with the following abnormal behavior:
>
> system.cpu.fetch.rateDist::0                    13812     23.92%
>     23.92% # Number of instructions fetched each cycle (Total)
> *system.cpu.fetch.rateDist::1                    42910     74.32%
>     98.24% # Number of instructions fetched each cycle (Total) *
> system.cpu.fetch.rateDist::2                      624      1.08%
>     99.32% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::3                      256      0.44%
>     99.77% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::4                       59      0.10%
>     99.87% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::5                       50      0.09%
>     99.95% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::6                        5      0.01%
>     99.96% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::7                        2      0.00%
>     99.97% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::8                       19      0.03%
>    100.00% # Number of instructions fetched each cycle (Total)
> system.cpu.fetch.rateDist::overflows                0      0.00%
>    100.00% # Number of instructions fetched each cycle (Total)
>
> I won't be digging further into this, since running without compressed
> format seems to fix the issue and is enough for my usage scenario. Just
> thought this information could be useful for someone.
>
> Cheers!
>
>
> On Thu, May 24, 2018 at 9:33 PM, Marcelo Brandalero <
> mbrandal...@inf.ufrgs.br> wrote:
>
>> Hi Jason, Alec,
>>
>> Thanks for the fast responses!
>>
>> I can say I managed to run a lot of benchmarks on O3 and none of them
>> crashed. I did notice however that their performance on for distinct-width
>> O3 processors had only minor differences (on x86, the differences were much
>> more significant).
>>
>> I ran into this particular issue only today, though, so I can only say it
>>  *seems* *to affect only binaries compíled with C extensions*.
>>
>> I'll run the tests suggested by both of you and reply here in case I find
>> anything interesting.
>>
>> Best regards,
>>
>>
>> On Thu, May 24, 2018 at 9:29 PM, Marcelo Brandalero <b.marc...@gmail.com>
>> wrote:
>>
>>> Hi Jason, Alec,
>>>
>>> Thanks for the fast responses!
>>>
>>> I can say I managed to run a lot of benchmarks on O3 and none of them
>>> crashed. I did notice however that their performance on for distinct-width
>>> O3 processors had only minor differences (on x86, the differences were much
>>> more significant).
>>>
>>> I ran into this particular issue only today, though, so I can only say
>>> it *seems* *to affect only binaries compíled with C extensions*.
>>>
>>> I'll run the tests suggested and reply here in case I find anything
>>> interesting.
>>>
>>> Best regards,
>>>
>>> On Thu, May 24, 2018 at 9:06 PM, Alec Roelke <ar...@virginia.edu> wrote:
>>>
>>>> Hi Marcelo,
>>>>
>>>> Yes, gem5 does support the C extension (64-bit version only, though).
>>>> I don't know what could be causing your particular issue.  I'm not sure
>>>> advancePC is the issue, though, because all that essentially does is call
>>>> PCState::advance(), which is inherited unchanged from
>>>> GenericISA::UPCState.  Try doing as Jason suggests and run your simulation
>>>> with the Fetch debug flag enabled, and maybe that will shed some light on
>>>> the issue.
>>>>
>>>> -Alec
>>>>
>>>> On Thu, May 24, 2018 at 7:20 PM, Jason Lowe-Power <ja...@lowepower.com>
>>>> wrote:
>>>>
>>>>> Hi Marcelo,
>>>>>
>>>>> I'm not sure if RISC-V has been tested with the out of order CPU at
>>>>> all! I'm happy that at least it doesn't completely fail!
>>>>>
>>>>> For you problem of only fetching 1 instruction per cycle... I think
>>>>> it's going to take some digging. My first guess would be that it could be 
>>>>> a
>>>>> problem with the advancePC() function that's implemented in the RISC-V
>>>>> decoder (in gem5/arch/riscv), but I don't really have any specific reason
>>>>> to think that :).
>>>>>
>>>>> You could try turning on some debug flags for the O3 CPU.
>>>>> Specifically, Fetch might be helpful.
>>>>>
>>>>> Cheers,
>>>>> Jason
>>>>>
>>>>> On Thu, May 24, 2018 at 4:06 PM Marcelo Brandalero <
>>>>> mbrandal...@inf.ufrgs.br> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I recently switched from gem5/x86 to gem5/RISCV due to some
>>>>>> advantages of this ISA.
>>>>>>
>>>>>> I'm getting some weird simulation results and I realized my compiler
>>>>>> was generating instructions for the compressed RISCV ISA extension (chp
>>>>>> 12 in the user level ISA specification
>>>>>> <https://riscv.org/specifications/>). The weirdness disappears when
>>>>>> I use *--march* to remove these extensions.
>>>>>>
>>>>>> *So the question is: does gem5/RISCV support this ISA extension? *If
>>>>>> so, I can share the weird results (maybe I'm missing something) but
>>>>>> basically a wide-issue O3 processor fetches only max 1 instruction/cycle
>>>>>> when it should probably be fetching more.
>>>>>>
>>>>>> If it doesn't support then it's all OK, I just find it a bit weird
>>>>>> that the program executes normally with no warnings whatsoever.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> --
>>>>>> Marcelo Brandalero
>>>>>> PhD Candidate
>>>>>> Programa de Pós Graduação em Computação
>>>>>> Universidade Federal do Rio Grande do Sul
>>>>>> _______________________________________________
>>>>>> gem5-users mailing list
>>>>>> gem5-users@gem5.org
>>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>>
>>>>>
>>>>
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> gem5-users@gem5.org
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>> --
>>> Marcelo Brandalero
>>>
>>
>>
>>
>> --
>> Marcelo Brandalero
>> PhD Candidate
>> Programa de Pós Graduação em Computação
>> Universidade Federal do Rio Grande do Sul
>>
>
>
>
> --
> Marcelo Brandalero
> PhD Candidate
> Programa de Pós Graduação em Computação
> Universidade Federal do Rio Grande do Sul
> _______________________________________________
> gem5-users mailing list
> gem5-users@gem5.org
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to