Re: [gem5-users] In-order faster than O3?
For anyone who might be interested, this issue has been resolved. There were two issues. First, I was assuming O3 used a write buffer to allow stores to retire early (implying non-SC memory consistency). I assumed one for my in-order processor but O3 appears to be SC and hence does not have one. Second, when I increased the MemWrite and MemRead latencies, I did not also increase the wbDepth setting beyond the default of 1, which created a bottleneck when multiple longer latency operations were executing simultaneously. On Thu, May 19, 2011 at 12:07 AM, Ali Saidi sa...@umich.edu wrote: Hi Marc, THe atomic latency isn't as accurate at the latency with the memory system in timing mode. What is returned is an unloaded latency (one request in the entire memory system/no contention at all). Ali On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote: Thanks Ali and Korey. My checkout is about a month old so that could be the issue. I'll take a look tomorrow. MSHR settings are okay. I'm using the atomic-reported memory latencies (dcache_latency and icache_latency) to compute access latencies in my model. I assume these are as accurate as the timing or O3 CPU latencies for single-threaded workloads. If I keep having issues I'm happy to share the CPU model but it's a research prototype configured to do exotic researchy sorts of things so I'm not sure how helpful that would be. =) On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote: I'd also take a look at how many MSHRs you are giving your caches and see if it matches w/your cpu model. For example, if you only have 2 mshrs, but your model is issuing up to 8 speculative loads, its a chance your system may be under provisioned and eventually lose some performance. On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote: Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks. If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users -- -
[gem5-users] In-order faster than O3?
Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] In-order faster than O3?
Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks. If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] In-order faster than O3?
I'd also take a look at how many MSHRs you are giving your caches and see if it matches w/your cpu model. For example, if you only have 2 mshrs, but your model is issuing up to 8 speculative loads, its a chance your system may be under provisioned and eventually lose some performance. On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote: Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks. If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users -- - Korey ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] In-order faster than O3?
Thanks Ali and Korey. My checkout is about a month old so that could be the issue. I'll take a look tomorrow. MSHR settings are okay. I'm using the atomic-reported memory latencies (dcache_latency and icache_latency) to compute access latencies in my model. I assume these are as accurate as the timing or O3 CPU latencies for single-threaded workloads. If I keep having issues I'm happy to share the CPU model but it's a research prototype configured to do exotic researchy sorts of things so I'm not sure how helpful that would be. =) On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote: I'd also take a look at how many MSHRs you are giving your caches and see if it matches w/your cpu model. For example, if you only have 2 mshrs, but your model is issuing up to 8 speculative loads, its a chance your system may be under provisioned and eventually lose some performance. On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote: Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks. If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users -- - Korey ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] In-order faster than O3?
exotic researchy sorts of things In my early morning delirious state, that sir is a +1. -- - Korey ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
Re: [gem5-users] In-order faster than O3?
Hi Marc, THe atomic latency isn't as accurate at the latency with the memory system in timing mode. What is returned is an unloaded latency (one request in the entire memory system/no contention at all). Ali On May 19, 2011, at 12:01 AM, Marc de Kruijf wrote: Thanks Ali and Korey. My checkout is about a month old so that could be the issue. I'll take a look tomorrow. MSHR settings are okay. I'm using the atomic-reported memory latencies (dcache_latency and icache_latency) to compute access latencies in my model. I assume these are as accurate as the timing or O3 CPU latencies for single-threaded workloads. If I keep having issues I'm happy to share the CPU model but it's a research prototype configured to do exotic researchy sorts of things so I'm not sure how helpful that would be. =) On Wed, May 18, 2011 at 11:33 PM, Korey Sewell ksew...@umich.edu wrote: I'd also take a look at how many MSHRs you are giving your caches and see if it matches w/your cpu model. For example, if you only have 2 mshrs, but your model is issuing up to 8 speculative loads, its a chance your system may be under provisioned and eventually lose some performance. On Thu, May 19, 2011 at 12:28 AM, Ali Saidi sa...@umich.edu wrote: Hi Marc, If you haven't updated your code recently, I committed some changes last week at fixed some dependency issues with the ARM condition codes in the o3 cpu model. Previously any instruction that wrote a condition code would have to do a read-modify-write operation on all the condition codes together meaning that a string of instructions that set condition codes were all dependent on each other. The committed code fixes this issue and sees improvement of up to 22% on some spec benchmarks. If that doesn't fix the issue, you'll need to see where the o3 model is stalling on your workload. Some of the statistics might help narrow it down a bit. The model should be able to issue dependent instructions in back-to-back cycles, and executes instruction speculatively (including loads). Any chance you'd share your cpu model? Are you sure you're accounting for memory latency correctly in it? The atomic memory mode completes a load/store instantly, so if you're not correctly accounting for the real time it would take for that load/store to complete that could be part of the issue. Ali On May 18, 2011, at 9:21 PM, Marc de Kruijf wrote: Hi all, I recently extended the atomic CPU model to simulate a deeply-pipelined two-issue in-order machine. The code includes variable length instruction latencies, checks for register dependences, has full bypass/forwarding capability, and so on. I have reason to believe it is working as it should. Curiously, when I run binaries using this CPU model, it frequently outperforms the O3 CPU model in terms of cycle count. The O3 model I compare against is also two-issue, has a 8-entry load queue, 8-entry store queue, 16-entry IQ, 32-entry ROB, extra physical regs, but is otherwise configured identically. The in-order core models identical branch prediction with a rather generous 13-cycle mispredict penalty for the two-issue core (e.g. as in ARM Cortex-A8), and still achieves better performance in most cases. I'm finding it hard to parse through all the O3 trace logs, so I was wondering if anyone has intuition as to why this might be the case. Does the O3 CPU not do full bypassing? Is there speculation going on beyond just branch prediction? I plan to look into the source code in more detail, but I was wondering if someone could give me a leg up by pointing me in the right direction. I've also noticed when I set the MemRead and MemWrite latencies in src/cpu/o3/FuncUnitConfig.py to anything greater than 1, O3 performance slows down quite drastically (~10% per increment). This doesn't really make sense to me either. I'm not configuring with a massive instruction window, but I wouldn't expect performance to suffer quite so much. If it helps, all my simulations so far are just using ARM. ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users -- - Korey ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users ___ gem5-users mailing list gem5-users@m5sim.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users