Re: Performance improvements of Marocchino implementation

el 01 Sun, 09 Mar 2025 10:02:54 -0700

Hello,

Hopefully this makes its way onto the mailing list, my previous email didn't.

Stafford's previous email basically covered what I did last summer. I've been dealing with some health issues and haven't been able to consistently document my progress; really sorry about the lack of documentation.

I left off at trying to figure out some (perhaps superficial) differences between measured cycle counts when running a benchmark on LiteX and FuseSoC, two different 'build systems' for the HDL design. This doesn't directly address the discrepancy between marocchino and mor1kx, but was a step along the way.

The build systems bundle the OpenRISC core with some other necessary hardware (e.g. simulated memory, peripherals, etc.) and build either a binary for simulation on your computer, or something which can be put on an FPGA.

When running binaries from the Embench benchmarking suite on the same processor core / simulation engine (and only changing the build system), there are some tests which have substantially different cycle counts.

Some initial data I gathered will be attached, it seems like there are some substantial differences in the cycles required to execute some instructions on LiteX.

I'm also not 100% on whether the measured cycle counts are completely accurate, as the debug / trace parts of the LiteX and FuseSoC are somewhat different.

Another minor thing that I wanted to address was some inefficiency in running LiteX simulations. Because of the way that the Embench testing script for LiteX works (see https://github.com/hhe07/litex-esp/blob/main/sim.py -- from what I remember this is stuff that you can copy into your Embench install folder to enable compatibility), I think the CPU and some of the supporting software is rebuilt every time a different benchmark is run, which wastes a lot of time.

As for where this fits into the larger issue of the performance discrepancy between mor1kx and marocchino, (in my opinion / experience) I spent a lot of time trying to figure out the tools and determining if what I wanted to do was a feature of a tool or something I needed to figure out. So, I'd recommend trying to understand the tooling and perhaps doing some practice tasks around it. YMMV, though.

I know I haven't really made this problem better due to poor documentation on my part, so please email if you're unsure about something that I did. I'll try to reply ASAP.


As for the attached files:

- profile.ods includes analysis on cycle counts per instruction for one test, I think nettle_sha256. This is for the mor1kx CPU. - results.ods includes cycle counts for all Embench tests run on both FuseSoC and LiteX, and calculated percent differences between the two. - nettle-mor1kx-{fusesoc, litex}-trace-prof include the outputs of cycle counts from the analysis scripts I wrote (basically same as profile.ods), as well as some additional information on the PCs of the start/end of critical sections in the code, and how many cycles they took to execute.



~ Leo

On 02/03/2025 21:11, Idzwan Nizam Jamal Abdul Nasir wrote:

Hi,

I am interested in OpenRISC Benchmarking and Performance improvements task 
listed as one of the project ideas in Google Summer of Code. I am unable to 
participate in GSOC but I would like to contribute to the task gradually as I 
acquire skills in digital logic and computer architecture.

Is the task still open? I would be glad if you could point me to the right 
direction such as documentation I should read or tools I have to be familiar 
with. Any guidance is welcome and greatly appreciated. Thank you.

profile.ods
Description: application/vnd.oasis.opendocument.spreadsheet

results.ods
Description: application/vnd.oasis.opendocument.spreadsheet

jal: 1.000000
jump: 1.000000
l.add: 1.066557
l.addi: 1.381047
l.and: 1.015590
l.andi: 1.117647
l.bf: 1.589264
l.bnf: 1.000000
l.jr: 1.750131
l.lbz: 1.000000
l.lhz: 1.000000
l.lwz: 1.148739
l.movhi: 1.004103
l.nop: 1.000000
l.or: 1.003138
l.ori: 1.018006
l.sb: 2.550000
l.sfgtu: 1.593730
l.sll: 1.006010
l.srl: 1.067661
l.sub: 1.000000
l.sw: 1.384018
l.xor: 1.012799
l.xori: 1.000000

nettle update pc: 45f0 -> 45f8
56 -> 189

nettle write digest: 4600 -> 4608

191 -> 10521

jal: 3.624573
jump: 1.040984
l.add: 2.005511
l.addi: 1.331264
l.and: 1.948052
l.andi: 2.000000
l.bf: 1.137044
l.bnf: 2.000000
l.jr: 3.000789
l.lbz: 1.041667
l.lhz: 1.625000             
l.lwz: 2.438729
l.movhi: 5.815331
l.nop: 5.398569
l.or: 2.064235
l.ori: 2.064082
l.sb: 2.542857
l.sfgtu: 1.212507
l.sll: 2.140017
l.srl: 2.075332
l.sub: 2.000000
l.sw: 3.226664
l.xor: 2.095170
l.xori: 3.500000

nettle_sha256.update pc: 4000352c: -> 40003530:
59 -> 554

nettle_sha256.write digest: 4000353c: -> 40003540:

558 -> 12034

Re: Performance improvements of Marocchino implementation

Reply via email to