Read ahead affect Ceph read performance much

Li Wang Mon, 29 Jul 2013 03:25:10 -0700

We performed Iozone read test on a 32-node HPC server. Regarding thehardware of each node, the CPU is very powerful, so does the network,with a bandwidth > 1.5 GB/s. 64GB memory, the IO is relatively slow, thethroughput measured by ‘dd’ locally is around 70MB/s. We configured aCeph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, oneclient per node. The performance is as follows,


    Iozone sequential read throughput (MB/s)
Number of clients     1          2         4
Default resize    180.0954   324.4836   591.5851
Resize: 256MB     645.3347   1022.998   1267.631


The complete iozone parameter for one client is,

iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w-c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, onlyone thread is started.


for two clients, it is,

iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w-c -e -b /tmp/iozone.nodelist.50305030.output


As the data shown, a larger read ahead window could result in >300% speedup!

Besides, Since the backend of Ceph is not the traditional hard disk, itis beneficial to capture the stride read prefetching. To prove this, wetested the stride read with the following program, as we know, thegeneric read ahead algorithm of Linux kernel will not capturestride-read prefetch, so we use fadvise() to manually force pretching.

the record size is 4MB. The result is even more surprising,

            Stride read throughput (MB/s)
Number of records prefetched  0      1      4      16      64      128
Throughput                  42.82  100.74 217.41  497.73  854.48  950.18

As the data shown, with a read ahead size of 128*4MB, the speedup over
without read ahead could be up to 950/42 > 2000%!

The core logic of the test program is below,

stride = 17
recordsize = 4MB
for (;;) {
  for (i = 0; i < count; ++i) {
    long long start = pos + (i + 1) * stride * recordsize;
    printf("PRE READ %lld %lld\n", start, start + block);
    posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED);
  }
  len = read(fd, buf, block);
  total += len;
  printf("READ %lld %lld\n", pos, (pos + len));
  pos += len;
  lseek(fd, (stride - 1) * block, SEEK_CUR);
  pos += (stride - 1) * block;
}

Given the above results and some more, We plan to submit a blue print todiscuss the prefetching optimization of Ceph.


Cheers,
Li Wang




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Read ahead affect Ceph read performance much

Reply via email to