False sharing/memory alignment performance with memory mapped files

Kyle Kavanagh Sun, 11 Sep 2016 09:38:10 -0700

Looking into the effects of false sharing when writing to memory mapped 
files.  Intuitively, I would think that to avoid false sharing between 
multiple threads writing (no reader threads) to the mapped file at 
different offsets at the same time, writes to the file should be aligned 
and cache-line-padded.


I wrote a simple JMH benchmark to test this, where in one test, 5 threads 
(-t 5) are writing single longs to the file with no cache padding, using an 
AtomicLong to track the current index into the file.  The other test still 
only writes singular longs, but adds 64 to the index counter on every 
iteration so that no two writes occur on the same cache line.  The results 
of the test across multiple forks were the opposite of what I had expected, 
with the unpadded implementation performing a non-trivial amount better:

Benchmark                    (blackhole)  Mode  Cnt    Score    Error  Units
AlignmentTest.testPadded               0  avgt   25  455.139 ± 59.933  ns/op
AlignmentTest.testUnpadded             0  avgt   25  374.866 ± 46.613  ns/op
AlignmentTest.testBlackHole            0  avgt   25  158.849 ± 20.971  ns/op

Whats also interesting is that this trend holds true even when running with 
a single thread (-t 1), though I could reason that this is due to less 
spatial locality.
Benchmark                    (blackhole)  Mode  Cnt   Score   Error  Units
AlignmentTest.testPadded               0  avgt   25  14.696 ± 0.008  ns/op
AlignmentTest.testUnpadded             0  avgt   25  11.145 ± 0.278  ns/op
AlignmentTest.testBlackHole            0  avgt   25   9.699 ± 0.008  ns/op

(testBlackhole measure the time it takes to increment the index counter).

Looking at the perf stats for both tests shows that the padded 
implementation performs better with regard to cache performance, as 
expected - In fact the only metric that is worse in the padded 
implementation is the CPI, which is nearly 2x that of the unpadded 
implementation.  To the best that I can tell, the assembly between the two 
implementation is the same as well.  Any ideas of other things that would 
cause padded writes to perform worse than writes which should cause tons of 
false sharing?

This was tested on an Intel E5-2667 Haswell with Hotspot jdk 1.8_92. 
Benchmark source attached. 

-- 
You received this message because you are subscribed to the Google Groups 
"mechanical-sympathy" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to mechanical-sympathy+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

package org.kavanagh.benchmark;

import java.io.File;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.atomic.AtomicLong;

import org.agrona.BitUtil;
import org.agrona.concurrent.AtomicBuffer;
import org.agrona.concurrent.UnsafeBuffer;
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.BenchmarkMode;
import org.openjdk.jmh.annotations.Fork;
import org.openjdk.jmh.annotations.Measurement;
import org.openjdk.jmh.annotations.Mode;
import org.openjdk.jmh.annotations.OutputTimeUnit;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import org.openjdk.jmh.annotations.TearDown;
import org.openjdk.jmh.annotations.Threads;
import org.openjdk.jmh.annotations.Warmup;
import org.openjdk.jmh.infra.Blackhole;

@State(Scope.Benchmark)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 200, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 30, time = 2, timeUnit = TimeUnit.SECONDS)
@Threads(5)
@Fork(5)
public class AlignmentTest {

  private static final long FILE_SIZE = 1073741824L;

  static {
    System.setProperty(UnsafeBuffer.DISABLE_BOUNDS_CHECKS_PROP_NAME, "true");
  }

  @Param({ "0", "5" })
  public int blackhole;

  private AtomicBuffer buffer;

  private File tmpFile;

  private RandomAccessFile raf;

  private MappedByteBuffer bbuffer;

  private AtomicLong idx;

  @Setup
  public void init() throws IOException {
    tmpFile = new File("/dev/shm/test.tmp");
    tmpFile.deleteOnExit();
    raf = new RandomAccessFile(tmpFile, "rw");
    FileChannel fileChannel = raf.getChannel();
    bbuffer = fileChannel.map(FileChannel.MapMode.READ_WRITE, 0, FILE_SIZE);
    buffer = new UnsafeBuffer(bbuffer);
    idx = new AtomicLong(0);

    System.out.println("Agrona Bounds Check: " + UnsafeBuffer.SHOULD_BOUNDS_CHECK);

  }

  @TearDown
  public void close() throws IOException {
    raf.close();
    tmpFile.delete();
  }

  private final void sleep() {
    if (blackhole > 0) {
      Blackhole.consumeCPU(blackhole);
    }
  }

  @Benchmark
  public int testBlackHole() {
    sleep();
    int offset = (int) idx.getAndAdd(BitUtil.CACHE_LINE_LENGTH);
    if (offset >= FILE_SIZE - BitUtil.CACHE_LINE_LENGTH) {
      idx.set(0);
      offset = 0;
    }
    return offset;
  }

  @Benchmark
  public int testUnpadded() {
    sleep();
    // Rollover is fine for this test
    int offset = (int) idx.getAndAdd(8);
    if (offset >= FILE_SIZE - 8) {
      idx.set(0);
      offset = 0;
    }
    buffer.putLong(offset, 123L);
    return offset;
  }

  @Benchmark
  public int testPadded() {
    sleep();
    // Rollover is fine for this test
    int offset = (int) idx.getAndAdd(BitUtil.CACHE_LINE_LENGTH);
    if (offset >= FILE_SIZE - BitUtil.CACHE_LINE_LENGTH) {
      idx.set(0);
      offset = 0;
    }
    buffer.putLong(offset, 123L);
    return offset;
  }

}

False sharing/memory alignment performance with memory mapped files

Reply via email to