Hi Everybody,

We are working on Atomic instructions in X86 ISA and how they are handled by gem5.

During running a simple benchmark which executes an Atomic_Increment within a loop for 1M iterations,

we have encountered 2 MILLION TIMES OF MEMORY_ORDER_VIOLATIONS which results in 2 Million times of squashing a LD_Inst _because of a missed memory dependence with a previous store_ !!!

After delving into the code of "_src/cpu/o3/mem_dep_unit_impl._hh", we have found the following observations which cause the abovementioned problem:

1) The function INSERT(DYNINSTPTR &INST), is responsible to insert the new Inst into the Inst_Queue.

2) If the coming Inst is a LD_Inst, then insert function tries to find out whether the LD_Inst has a dependency with an in-flight Memory Barriers or a preceding ST_Inst or not.

3) If yes, then it adds the LD_Inst into the dependent_vector_list of that Mem_Barrier or ST_Inst. (THE PROBLEM IS HERE) 

THE PROBLEM: the default order which gem5 looks for a producing_store for the LD_Inst is that it gives priority to Mem_Barriers, and 

only if we do not have a Mem_Barrier then we take a look at store_set mem_predictor to find the latest preceding store associates with the LD_Inst.

This order of finding the producing_store results in many numbers of Mem_Order_violations in the following example:

Example benchmark:

for(int i=0; i < 1M ; i++)
{
   
     /* This assambly represents a simple atomic_Increment in a loop */

       -- Mem_Fence
    -- Store _x_
    -- Load _x_
}

 
Regarding the above snippet, _in theory_, Load _x_ should be _dependent_ on Store x, however, _according to the gem5 implementation_, the Mem_Fence is selected as the _dependent instruction and the dependence between the load and the store is obviated_;

_Having just the Load dependent only on the Mem_Fence_ in the above code, which is our simple benchmark, brings about 2 Million times of Squashes (i.e., Mem_Order_Violations) that degrades the performance significantly.

THE SOLUTION: 
- _adding the dependence between the store and the load solves_ the problem, causeing the number of squashes (i.e., Mem_Order_Violations) to drop from 2 Millions to only 900_, and reducing execution time._

We want to ask whether our observation regarding how _to add dependencies for_ LD_Inst in "_src/cpu/o3/mem_dep_unit_impl._hh"  is correct or not.

In another word, we want to ask if adding also the dependence to the store (apart from the fence) can be done in gem5 with minor modifications?

Thanks a lot for reading our email and appreciate a lot your considerations.

Sincerely,
Ashkan Asgharzadeh, 
  Ph.D. Student at the CS Faculty, 
University of Murcia, Spain  
_______________________________________________
gem5-dev mailing list
gem5-dev@gem5.org
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to