Hi Everybody,
We are working on Atomic instructions in X86 ISA and how they are
handled by gem5.
During running a simple benchmark which executes an Atomic_Increment
within a loop for 1M iterations,
we have encountered 2 MILLION TIMES OF MEMORY_ORDER_VIOLATIONS which
results in 2 Million times of squashing a LD_Inst _because of a
missed memory dependence with a previous store_ !!!
After delving into the code of "_src/cpu/o3/mem_dep_unit_impl._hh",
we have found the following observations which cause the
abovementioned problem:
1) The function INSERT(DYNINSTPTR &INST), is responsible to insert
the new Inst into the Inst_Queue.
2) If the coming Inst is a LD_Inst, then insert function tries to
find out whether the LD_Inst has a dependency with an in-flight
Memory Barriers or a preceding ST_Inst or not.
3) If yes, then it adds the LD_Inst into the dependent_vector_list
of that Mem_Barrier or ST_Inst. (THE PROBLEM IS HERE)
THE PROBLEM: the default order which gem5 looks for a
producing_store for the LD_Inst is that it gives priority to
Mem_Barriers, and
only if we do not have a Mem_Barrier then we take a look at
store_set mem_predictor to find the latest preceding store
associates with the LD_Inst.
This order of finding the producing_store results in many numbers of
Mem_Order_violations in the following example:
Example benchmark:
for(int i=0; i < 1M ; i++)
{
/* This assambly represents a simple atomic_Increment in a loop */
-- Mem_Fence
-- Store _x_
-- Load _x_
}
Regarding the above snippet, _in theory_, Load _x_ should
be _dependent_ on Store x, however, _according to the gem5
implementation_, the Mem_Fence is selected as the _dependent
instruction and the dependence between the load and the store is
obviated_;
_Having just the Load dependent only on the Mem_Fence_ in the above
code, which is our simple benchmark, brings about 2 Million times of
Squashes (i.e., Mem_Order_Violations) that degrades the performance
significantly.
THE SOLUTION:
- _adding the dependence between the store and the load solves_ the
problem, causeing the number of squashes (i.e.,
Mem_Order_Violations) to drop from 2 Millions to only 900_, and
reducing execution time._
We want to ask whether our observation regarding how _to add
dependencies for_ LD_Inst in "_src/cpu/o3/mem_dep_unit_impl._hh" is
correct or not.
In another word, we want to ask if adding also the dependence to the
store (apart from the fence) can be done in gem5 with minor
modifications?
Thanks a lot for reading our email and appreciate a lot your considerations.
Sincerely,
Ashkan Asgharzadeh,
Ph.D. Student at the CS Faculty,
University of Murcia, Spain
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev