Hi all, follow-up question. In ARMv8, the LDP instruction:
LDP <Qt1>, <Qt2>, [<Xn|SP>{, #<imm>}] Will load a pair of 128-bit values (256 in total) from memory to two Q registers (128-bit vector registers). When I run debug gem5 to see how the said LDP instruction operates (in AtomicCPU, for now), I see that it is broken down into 3 micro-ops: 2 loads + 1 register writeback (due to post-increment I'm using in the instruction). However, I don't get why gem5 triggers two memory loads if the 256-bit that will feed the registers are contiguous in memory. Couldn't memory provide 256-bit to feed both dest. registers at once? Some possible reasons I thought: - memory port only allows 128-bit loads. Although this could be the case, reading the size of a cache line (64B) would sound more reasonable. - we have only one write port We need two load micro-ops because we can write only one destination register at a time (and we have two destination registers). But, in this case, why issue a new memory load in the second uop, if the previous load had already brought the data (considering memory returns 64B/512-bits)? Why not keep the data memory within the "macro op context" (if such a thing exists)? Is it simply relying on the cache? Any clarification on what is the reason for the functioning of this operation (or macro memory operations in ARM as a whole) is much welcomed! Thank you, Pedro. _______________________________________________ gem5-users mailing list -- gem5-users@gem5.org To unsubscribe send an email to gem5-users-le...@gem5.org %(web_page_url)slistinfo%(cgiext)s/%(_internal_name)s