Answers:
1. Yes, I believe so. However, I have never personally tried using the O3
model with the GPU. Matt P has, I believe, so he may have better feedback
there.
2. I have not followed the chain of events all the way through here, but I
*believe* that the builtin you highlighted is used at the compiler level by
HIPCC/LLVM to generate the appropriate assembly for a given AMD GPU. In
this case (gfx900), I believe there is a 1-1 correlation with this builtin
becoming an s_sleep assembly instruction (maybe with the addition of a
v_mov-type instruction before it to set the register to the appropriate
sleep value). I am not aware of s_sleep()'s builtin requiring OS calls (or
emulation). But what you have described is more generally the issue with
SE mode (CPU, GPU, etc.) -- because SE mode does not model OS calls, the
fidelity of anything involving the OS will be less. Perhaps a trite way to
answer this is: if the fidelity of the OS calls is important for the
applications you are studying, then I strongly recommend using FS mode.
Hope this helps,
Matt S.
On Tue, Jul 4, 2023 at 6:01 AM Anoop Mysore wrote:
> Thank you so much for the kind and detailed explanations!
>
> Just to clarify: I can use the APU config (apu_se.py) and switch out to an
> O3 CPU, and I would still have the detailed GPU model, and the disconnected
> Ruby model that synchronizes between CPU and GPU at the system-level
> directory -- is that correct?
>
> Last question: when using the APU config for simulating HeteroSync which,
> for example, has a sleep mutex primitive that invokes a
> __builtin_amdgcn_s_sleep(), is there any OS involvement? If yes, would SE
> mode's emulation of those syscalls inexorably sacrifice any fidelity that
> could be argued leads to inaccurate evaluations of heterogeneous coherence
> implementations? Or are any there other factors of insufficient fidelity
> that might be important in this regard?
>
>
> On Fri, Jun 30, 2023 at 7:40 PM Matt Sinclair <
> mattdsinclair.w...@gmail.com> wrote:
>
>> Just to follow-up on 4 and 5:
>>
>> 4. The synchronization should happen at the directory-level here, since
>> this is the first level of the memory system where both the CPU and GPU are
>> connected. However, I have not tested if the programmer sets the GLC bit
>> (which should perform the atomic at the GPU's LLC) if Ruby has the
>> functionality to send invalidations as appropriate to allow this. I
>> suspect it would work as is, but would have to check ...
>>
>> 5. Yeah, for the reasons Matt P already stated O3 is not currently
>> supported in GPUFS. So GPUSE would be a better option here. Yes, you can
>> use the apu_se.py script as the base script for running GPUSE experiments.
>> There are a number of examples on gem5-resources for how to get started
>> with this (including HeteroSync), but I normally recommend starting with
>> square if you haven't used the GPU model before:
>> https://gem5.googlesource.com/public/gem5-resources/+/refs/heads/develop/src/gpu/square/.
>> In terms of support for synchronization at different levels of the memory
>> hierarchy, but default the GPU VIPER coherence protocol assumes that all
>> synchronization happens at the system-level (at the directory, in the
>> current implementation). However, one of my students will be pushing
>> updates (hopefully today) that allow non-system level support (e.g., the
>> GPU LLC "GLC" level as mentioned above). It sounds like you want to change
>> the cache hierarchy and coherence protocol to add another level of cache
>> (the L3) before the directory and after the CPU/GPU LLCs? If so, you would
>> need to change the current Ruby support to add this additional level and
>> the appropriate transitions to do so. However, if you instead meant that
>> you are thinking of the directory level as synchronizing between the CPU
>> and GPU, then you could use the support as is without any changes (I think).
>>
>> Hope this helps,
>> Matt S.
>>
>> On Fri, Jun 30, 2023 at 12:05 PM Poremba, Matthew via gem5-users <
>> gem5-users@gem5.org> wrote:
>>
>>> [Public]
>>>
>>> Hi,
>>>
>>>
>>>
>>>
>>>
>>> No worries about the questions! I will try to answer them all, so this
>>> will be a long email 😊:
>>>
>>>
>>>
>>> The disconnected (or disjoint) Ruby network is essentially the same as
>>> the APU Ruby network used in SE mode - That is, it combines two Ruby
>>> protocols in one protocol (MOESI_AMD_base and GPU_VIPER). They are
>>> disjointed because there are no paths / network links between the GPU and
>>> CPU side, simulating a discrete GPU. These protocols work together because
>>> they use the same network messages / virtual channels to the directory –
>>> Basically you cannot simply drop in another CPU protocol and have it work.
>>>
>>>
>>>
>>> Atomic CPU is working **very** recently – As in this week. It is on
>>> review board right now and I believe might be part of the gem5 v23.0
>>> release. However, the reason Atomic and KVM CPUs are requi