Hello all,

This is not strictly a Gem5 question, and is more to do with micro-architecture 
design itself... but I thought it might be appropriate to ask here since others 
might have looked into the same before.


I am trying to understand the execution latency of ARM NEON instructions. For 
example, the VADD instruction (as per the gem5 config file) has a 4 cycle 
execution latency. How is this 4 cycle latency divided in terms of actual 
execution? As in, how many of these 4 cycles go into pre-processing the data 
before the addition actually occurs? If there is no pre-processing, why is this 
a 4-cycle operation when normal Adds are single cycle?


Related to this, NEON can handle 8-bit to 64-bit data-width operations. Are 
higher width operations (eg. 64-bit) performed in some form of 8-bit atomics?

If anyone can point to open-source links that described the NEON 
pipeline/execution-unit in a detailed manner, that would be great as well. 
Thank you!


Best,

Gokul Subramanian Ravi,
Graduate Student,
ECE Dept.,
University of Wisconsin-Madison
_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to