Hi Jason, Alec,

Just to provide some feedback on this issue, it seems that the processor is
mistakenly identifying (add reg, reg, reg) in compressed format as a branch
instruction.

I'm running a kernel that looks like this (result from
*riscv64-unknown-elf-objdump
-D*)

000000000001019a <myFunction>:
  1019a:       06400793                li      a5,100
  1019e:       4701                    li      a4,0
  101a0:       4681                    li      a3,0
  101a2:       4601                    li      a2,0
  101a4:       0c800513                li      a0,200
  101a8:       952a                    add     a0,a0,a0
  101aa:       9632                    add     a2,a2,a2
  101ac:       96b6                    add     a3,a3,a3
  101ae:       973a                    add     a4,a4,a4




*   101b0:       952a                    add     a0,a0,a0   101b2:
      9632                    add     a2,a2,a2   101b4:       96b6
                   add     a3,a3,a3   101b6:       973a
                   add     a4,a4,a4*(repeat the four instructions above
until this:)
  104b8:       952a                    add     a0,a0,a0
  104ba:       9632                    add     a2,a2,a2
  104bc:       96b6                    add     a3,a3,a3
  104be:       973a                    add     a4,a4,a4
  104c0:       952a                    add     a0,a0,a0
  104c2:       2501                    sext.w  a0,a0
  104c4:       9632                    add     a2,a2,a2
  104c6:       2601                    sext.w  a2,a2
  104c8:       96b6                    add     a3,a3,a3
  104ca:       2681                    sext.w  a3,a3
  104cc:       973a                    add     a4,a4,a4
  104ce:       2701                    sext.w  a4,a4
  104d0:       37fd                    addiw   a5,a5,-1
  104d2:       cc079be3                bnez    a5,101a8 <myFunction+0xe>

And what the Fetch stage looks like when fetching this code block is this:

4048968: system.cpu.fetch: [tid:0] Waking up from cache miss.
4048968: system.cpu.fetch: Running stage.
4048968: system.cpu.fetch: Attempting to fetch from [tid:0]
4048968: system.cpu.fetch: [tid:0]: Icache miss is complete.
4048968: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4048968: system.cpu.fetch: [tid:0]: Instruction PC 0x101a8 (0) created
[sn:8124].
4048968: system.cpu.fetch: [tid:0]: Instruction is: c_add a0, a0, a0
4048968: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4048968: system.cpu.fetch: Branch detected with PC =
(0x101a8=>0x101aa).(0=>1)*
4048968: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4048968: system.cpu.fetch: [tid:0][sn:8124]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4049281: system.cpu.fetch: Running stage.
4049281: system.cpu.fetch: Attempting to fetch from [tid:0]
4049281: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4049281: system.cpu.fetch: [tid:0]: Instruction PC 0x101aa (0) created
[sn:8125].
4049281: system.cpu.fetch: [tid:0]: Instruction is: c_add a2, a2, a2
4049281: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4049281: system.cpu.fetch: Branch detected with PC =
(0x101aa=>0x101ac).(0=>1)*
4049281: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4049281: system.cpu.fetch: [tid:0][sn:8125]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4049594: system.cpu.fetch: Running stage.
4049594: system.cpu.fetch: Attempting to fetch from [tid:0]
4049594: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4049594: system.cpu.fetch: [tid:0]: Instruction PC 0x101ac (0) created
[sn:8126].
4049594: system.cpu.fetch: [tid:0]: Instruction is: c_add a3, a3, a3
4049594: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4049594: system.cpu.fetch: Branch detected with PC =
(0x101ac=>0x101ae).(0=>1)*
4049594: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4049594: system.cpu.fetch: [tid:0][sn:8126]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4049907: system.cpu.fetch: Running stage.
4049907: system.cpu.fetch: Attempting to fetch from [tid:0]
4049907: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4049907: system.cpu.fetch: [tid:0]: Instruction PC 0x101ae (0) created
[sn:8127].
4049907: system.cpu.fetch: [tid:0]: Instruction is: c_add a4, a4, a4
4049907: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4049907: system.cpu.fetch: Branch detected with PC =
(0x101ae=>0x101b0).(0=>1)*
4049907: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4049907: system.cpu.fetch: [tid:0][sn:8127]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4050220: system.cpu.fetch: Running stage.
4050220: system.cpu.fetch: Attempting to fetch from [tid:0]
4050220: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4050220: system.cpu.fetch: [tid:0]: Instruction PC 0x101b0 (0) created
[sn:8128].
4050220: system.cpu.fetch: [tid:0]: Instruction is: c_add a0, a0, a0
4050220: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4050220: system.cpu.fetch: Branch detected with PC =
(0x101b0=>0x101b2).(0=>1)*
4050220: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4050220: system.cpu.fetch: [tid:0][sn:8128]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4050533: system.cpu.fetch: Running stage.
4050533: system.cpu.fetch: Attempting to fetch from [tid:0]
4050533: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4050533: system.cpu.fetch: [tid:0]: Instruction PC 0x101b2 (0) created
[sn:8129].
4050533: system.cpu.fetch: [tid:0]: Instruction is: c_add a2, a2, a2
4050533: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4050533: system.cpu.fetch: Branch detected with PC =
(0x101b2=>0x101b4).(0=>1)*
4050533: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4050533: system.cpu.fetch: [tid:0][sn:8129]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.
4050846: system.cpu.fetch: Running stage.
4050846: system.cpu.fetch: Attempting to fetch from [tid:0]
4050846: system.cpu.fetch: [tid:0]: Adding instructions to queue to decode.
4050846: system.cpu.fetch: [tid:0]: Instruction PC 0x101b4 (0) created
[sn:8130].
4050846: system.cpu.fetch: [tid:0]: Instruction is: c_add a3, a3, a3
4050846: system.cpu.fetch: [tid:0]: Fetch queue entry created (1/256).
*4050846: system.cpu.fetch: Branch detected with PC =
(0x101b4=>0x101b6).(0=>1)*
4050846: system.cpu.fetch: [tid:0]: Done fetching, predicted branch
instruction encountered.
4050846: system.cpu.fetch: [tid:0][sn:8130]: Sending instruction to decode
from fetch queue. Fetch queue size: 1.

Not sure if it's a decoder problem or what, but it seems to affect only
instructions in the compressed format. It manifests itself in the
statistics with the following abnormal behavior:

system.cpu.fetch.rateDist::0                    13812     23.92%     23.92%
# Number of instructions fetched each cycle (Total)
*system.cpu.fetch.rateDist::1                    42910     74.32%
    98.24% # Number of instructions fetched each cycle (Total) *
system.cpu.fetch.rateDist::2                      624      1.08%     99.32%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::3                      256      0.44%     99.77%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::4                       59      0.10%     99.87%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::5                       50      0.09%     99.95%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::6                        5      0.01%     99.96%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::7                        2      0.00%     99.97%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::8                       19      0.03%    100.00%
# Number of instructions fetched each cycle (Total)
system.cpu.fetch.rateDist::overflows                0      0.00%    100.00%
# Number of instructions fetched each cycle (Total)

I won't be digging further into this, since running without compressed
format seems to fix the issue and is enough for my usage scenario. Just
thought this information could be useful for someone.

Cheers!


On Thu, May 24, 2018 at 9:33 PM, Marcelo Brandalero <
mbrandal...@inf.ufrgs.br> wrote:

> Hi Jason, Alec,
>
> Thanks for the fast responses!
>
> I can say I managed to run a lot of benchmarks on O3 and none of them
> crashed. I did notice however that their performance on for distinct-width
> O3 processors had only minor differences (on x86, the differences were much
> more significant).
>
> I ran into this particular issue only today, though, so I can only say it
> *seems* *to affect only binaries compíled with C extensions*.
>
> I'll run the tests suggested by both of you and reply here in case I find
> anything interesting.
>
> Best regards,
>
>
> On Thu, May 24, 2018 at 9:29 PM, Marcelo Brandalero <b.marc...@gmail.com>
> wrote:
>
>> Hi Jason, Alec,
>>
>> Thanks for the fast responses!
>>
>> I can say I managed to run a lot of benchmarks on O3 and none of them
>> crashed. I did notice however that their performance on for distinct-width
>> O3 processors had only minor differences (on x86, the differences were much
>> more significant).
>>
>> I ran into this particular issue only today, though, so I can only say it
>> *seems* *to affect only binaries compíled with C extensions*.
>>
>> I'll run the tests suggested and reply here in case I find anything
>> interesting.
>>
>> Best regards,
>>
>> On Thu, May 24, 2018 at 9:06 PM, Alec Roelke <ar...@virginia.edu> wrote:
>>
>>> Hi Marcelo,
>>>
>>> Yes, gem5 does support the C extension (64-bit version only, though).  I
>>> don't know what could be causing your particular issue.  I'm not sure
>>> advancePC is the issue, though, because all that essentially does is call
>>> PCState::advance(), which is inherited unchanged from
>>> GenericISA::UPCState.  Try doing as Jason suggests and run your simulation
>>> with the Fetch debug flag enabled, and maybe that will shed some light on
>>> the issue.
>>>
>>> -Alec
>>>
>>> On Thu, May 24, 2018 at 7:20 PM, Jason Lowe-Power <ja...@lowepower.com>
>>> wrote:
>>>
>>>> Hi Marcelo,
>>>>
>>>> I'm not sure if RISC-V has been tested with the out of order CPU at
>>>> all! I'm happy that at least it doesn't completely fail!
>>>>
>>>> For you problem of only fetching 1 instruction per cycle... I think
>>>> it's going to take some digging. My first guess would be that it could be a
>>>> problem with the advancePC() function that's implemented in the RISC-V
>>>> decoder (in gem5/arch/riscv), but I don't really have any specific reason
>>>> to think that :).
>>>>
>>>> You could try turning on some debug flags for the O3 CPU. Specifically,
>>>> Fetch might be helpful.
>>>>
>>>> Cheers,
>>>> Jason
>>>>
>>>> On Thu, May 24, 2018 at 4:06 PM Marcelo Brandalero <
>>>> mbrandal...@inf.ufrgs.br> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I recently switched from gem5/x86 to gem5/RISCV due to some advantages
>>>>> of this ISA.
>>>>>
>>>>> I'm getting some weird simulation results and I realized my compiler
>>>>> was generating instructions for the compressed RISCV ISA extension (chp
>>>>> 12 in the user level ISA specification
>>>>> <https://riscv.org/specifications/>). The weirdness disappears when I
>>>>> use *--march* to remove these extensions.
>>>>>
>>>>> *So the question is: does gem5/RISCV support this ISA extension? *If
>>>>> so, I can share the weird results (maybe I'm missing something) but
>>>>> basically a wide-issue O3 processor fetches only max 1 instruction/cycle
>>>>> when it should probably be fetching more.
>>>>>
>>>>> If it doesn't support then it's all OK, I just find it a bit weird
>>>>> that the program executes normally with no warnings whatsoever.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> --
>>>>> Marcelo Brandalero
>>>>> PhD Candidate
>>>>> Programa de Pós Graduação em Computação
>>>>> Universidade Federal do Rio Grande do Sul
>>>>> _______________________________________________
>>>>> gem5-users mailing list
>>>>> gem5-users@gem5.org
>>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> gem5-users@gem5.org
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>> --
>> Marcelo Brandalero
>>
>
>
>
> --
> Marcelo Brandalero
> PhD Candidate
> Programa de Pós Graduação em Computação
> Universidade Federal do Rio Grande do Sul
>



-- 
Marcelo Brandalero
PhD Candidate
Programa de Pós Graduação em Computação
Universidade Federal do Rio Grande do Sul
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to