Hi everybody,

During  running  some  benchmarks  (actually, a  modified version  of
ocean-contiguous-partitions                from Splash-3
"https://github.com/SakalisC/Splash-3";), we encountered a deadlock.

After diving into the trace files, we found that an atomic instruction
was locking the  memory of the requested block. This  lock needs to be
released using the 'stul' micro-op,  but its memory request was locked
at the LSQ-Unit, because a load  is waiting for a cache response. This
load will  never finish  because it  is referenced  to the  same cache
block as the locked block.

X86 atomics are defined surounded by two memory barriers:

mfence
ldstl
...
stul
mfence

so,  a  later  memory  instruction   has  to  wait  until  the mfence
finish. The memory dependence has a special handler for fences. When a
fence is  added, it stores  that a fence  is enabled and  its sequence
number.  Then,  it  will  add  the last  current  fence  as  a memory
dependency for all the instructions until it commits:

+---+------------+-----+
|seq|Instructions|Fence|
+---+------------+-----+
|  0| add        |     |
|  1| mfence     |   1 |
|  2| ldstl      |   1 |
|  3| add        |   1 |
|  4| stul       |   1 |
|  5| mfence     |   5 |
|  6| load       |   5 |
|  7| mfence     |   7 |
+---+------------+-----+

With this idea, everything should work,  but What happens when a later
mfence is squashed?

Looking at the memory dependence unit, we see that the fence checks if
it is the  current fence, and if  it is, the fence is  disabled in the
memory dependency unit. Therefore, What happens if a fence is squashed
but a  previous fence did not  commit yet? In the  following table, we
can see a possible case:

+---+------------+-----+---------+
|seq|Instructions|Fence|Committed|
+---+------------+-----+---------+
|  0| add        |     |     Yes |
|  1| mfence     |   1 |     Yes |
|  2| ldstl      |   1 |     Yes |
|  3| add        |   1 |     Yes |
|  4| stul       |   1 |      No |
|  5| mfence     |   5 |      No |
|  6| load       |   5 |      No |
|  7| beq        |   5 |      No |---+
|  8| mfence     |   8 |      No |   |
|  9| ldstl      |   5 |      No |   | Squashed
| 10| sub        |   5 |      No |   |
| 11| stul       |   5 |      No |   |
| 12| mfence     |  12 |      No |<--+
| 13| load       |     |      No |
+---+------------+-----+---------+

The branch instruction  is  mispredicted, but  new  fences  were set,
therefore, the original fence at seq:5 is no longer active despite the
fact it is  not committed. Now, the load instruction  at seq:13 can be
executed, and if  it collides with the  unfinished 'stul' instruction,
it can cause a memory dependency violation and later a deadlock.

It should be like this:

+---+------------+-----+---------+
|seq|Instructions|Fence|Committed|
+---+------------+-----+---------+
|  0| add        |     |     Yes |
|  1| mfence     |   1 |     Yes |
|  2| ldstl      |   1 |     Yes |
|  3| add        |   1 |     Yes |
|  4| stul       |   1 |      No |
|  5| mfence     |   5 |      No |<-----------------+
|  6| load       |   5 |      No |                  |
|  7| beq        |   5 |      No |---+              |
|  8| mfence     |   8 |      No |   |              |
|  9| ldstl      |   5 |      No |   | Squashed     | Dependency
| 10| sub        |   5 |      No |   |              | Recovered
| 11| stul       |   5 |      No |   |              |
| 12| mfence     |  12 |      No |<--+              |
| 13| load       |   5 |      No |------------------+
+---+------------+-----+---------+

To solve this problem, we have multiple ideas:
- Store all the active fences in  a "stack-like" structure, and when a
  fence is  removed/squashed recover the  last active fence. (This is
  the solution we implemented, find the patch attached)
- Give to the branch the information  about the last active fence, and
  when it is squashed, recover it
- Add to the new fence a  dependency with the current active fence and
  recover it when squashed

We want to know your thoughts about this problem and how to solve it.

Environment used:
- Arch: X86
- Simulation: Full-System
- Number of Cores: 16
- Cache: Ruby (L1 and L2)
- Coherence Protocol: MESI_TWO_Levels
- Kernel: Linux 4.9.3
- OS: Ubuntu 16.04
- Application: ocean-contiguous (with modifications)

Thanks a lot for your attention.

Best Regards,
Eduardo

--
Eduardo José Gómez Hernández
[email protected]
Faculty of Computer Science
University of Murcia (Spain)

_______________________________________________
gem5-dev mailing list
[email protected]
http://m5sim.org/mailman/listinfo/gem5-dev

Reply via email to