Re: [gem5-users] Linux's memcpy can't saturize DRAM bandwidth with O3 core?

Nils Asmussen Wed, 18 May 2016 02:31:43 -0700

Oh, I'm sorry, I've just seen that the default config for x86 does
really not have a prefetcher set. I somehow assumed that it has :/


I've now set a prefetcher for the L2 cache as on ARM:
prefetcher = StridePrefetcher(degree=8, latency = 1)

With that, I achieve a bandwidth of 1702 MiB/s. So, that's at least a
lot better than before, although I'm still quite a bit away from 4352 MiB/s.

Shouldn't it be much better than that?

Best regards,
Nils



On 05/18/2016 10:38 AM, Nils Asmussen wrote:
> Hi,
> 
> does anybody know why Linux cannot saturate the DRAM bandwidth with an
> O3 core? Or knows how I can track the problem down?
> 
> Best regards,
> Nils
> 
> 
> On 05/11/2016 10:20 AM, Nils Asmussen wrote:
>> Hi again,
>>
>> I've now played around a bit. First, I noticed that I should better do
>> the resetstats and dumpstats in my benchmark program directly before and
>> after the file reading via pseudo instructions. Instead of using the m5
>> program as I did before. This decreases the difference a bit, but the
>> effect is still there.
>>
>> Using the default x86 config as last time, I achieve now 1079 MiB/s on
>> Linux. With my device, I achieve 4352 MiB/s.
>>
>> Then I have copied the parameters from O3_ARM_v7a_3 (except fuPool,
>> because I don't know whether that's a good idea) to a new subclass of
>> DerivO3CPU. With that, I achieve a bandwidth of 608 MiB/s.
>>
>> Finally, I set LQEntries and SQEntries to 128 (otherwise, it's the
>> default DerivO3CPU) to hopefully increase the number of prefetched cache
>> lines. But does even decrease the bandwidth slightly to 1034 MiB/s.
>>
>> Is there something else I need to do to improve the prefetching?
>>
>> I have also uploaded the stats.txt files from Linux on the default x86
>> config and the one from the system with my device, if you want to take a
>> look:
>> Linux: https://gist.github.com/Nils-TUD/18c614553463fbd2fa6df74fd31440b4
>> Dev: https://gist.github.com/Nils-TUD/058fb8e8de4981b5b04d4389c8aef41e
>>
>> In the latter case, the DRAM controller sits in pe8, so you can find the
>> stats at the very bottom.
>>
>> Best regards,
>> Nils
>>
>>
>>
>> On 05/10/2016 03:56 PM, Nils Asmussen wrote:
>>> Hi Andreas,
>>>
>>> thanks for the quick response.
>>>
>>> Doing the experiment on ARM would be a bit of effort. Can't I simply
>>> tune the parameters of the O3 CPU like ARM does, i.e., copy them from
>>> configs/O3_ARM_v7a.py?
>>>
>>> What do you mean with "add prefetches to the cache configs"?
>>>
>>> Best regards,
>>> Nils
>>>
>>>
>>> On 05/10/2016 03:39 PM, Andreas Hansson wrote:
>>>> Hi Nils,
>>>>
>>>> I suspect this is all down to prefetching, or lack thereof. I would
>>>> suggest to try your experiment with build/ARM/gem5.opt and the
>>>> arm_detailed CPU (or alternatively add prefetches to the cache configs you
>>>> are using at the moment).
>>>>
>>>> Andreas
>>>>
>>>> On 10/05/2016, 14:36, "gem5-users on behalf of Nils Asmussen"
>>>> <[email protected] on behalf of [email protected]> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm running Linux 3.18 on a single-core x86 system using the O3 model.
>>>>> The command line is:
>>>>> ./build/X86/gem5.opt configs/example/fs.py --cpu-type detailed
>>>>> --cpu-clock=1GHz --sys-clock=1GHz --caches --l2cache
>>>>> --command-line="ttyS0 noapictimer console=ttyS0 lpj=7999923
>>>>> root=/dev/sda1"
>>>>>
>>>>> On Linux, I'm executing a self-written benchmark, which reads a 2 MiB
>>>>> file using read system calls. That means, in the end, the kernel is
>>>>> doing a memcpy in the kernel to copy that file into the user buffer.
>>>>> Looking at stats.txt (only measuring the benchmark itself), I see 274
>>>>> MiB/s at the DRAM controller.
>>>>>
>>>>> In my project, I developed a device, which can be used to e.g.
>>>>> load data from the DRAM. I run a similar benchmark on my system that
>>>>> reads a 2 MiB file from the DRAM using that device. In this case, I'm
>>>>> seeing 3 GiB/s at the DRAM controller.
>>>>>
>>>>> The main difference is that my device fetches 1 KiB at once from the
>>>>> DRAM, while the memcpy loads it cacheline by cacheline, i.e. 64 bytes at
>>>>> once.
>>>>>
>>>>> Is that expected behaviour or am I doing something wrong?
>>>>>
>>>>> Let me know if you need more information.
>>>>>
>>>>> Best regards,
>>>>> Nils
>>>>>
>>>>
>>>> IMPORTANT NOTICE: The contents of this email and any attachments are 
>>>> confidential and may also be privileged. If you are not the intended 
>>>> recipient, please notify the sender immediately and do not disclose the 
>>>> contents to any other person, use it for any purpose, or store or copy the 
>>>> information in any medium. Thank you.
>>>> _______________________________________________
>>>> gem5-users mailing list
>>>> [email protected]
>>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gem5-users mailing list
>>> [email protected]
>>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>>
>>
>>
>>
>>
>> _______________________________________________
>> gem5-users mailing list
>> [email protected]
>> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>>
> 
> 
> 
> 
> _______________________________________________
> gem5-users mailing list
> [email protected]
> http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
>

signature.asc
Description: OpenPGP digital signature

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] Linux's memcpy can't saturize DRAM bandwidth with O3 core?

Reply via email to