[Beowulf] big read triggers migration and slow memory IO?

mathog Wed, 08 Jul 2015 14:28:04 -0700

This big Dell (PowerEdge T620/03GCP, 48 CPUs, >500Gb RAM) keeps throwingme curve balls.

On the RAID file system there are a bunch of files having about17453170224 bytes. (Slightly different numbers of fixed lengthrecords.) At one level these bytes move around very quickly, this takesonly 3 seconds:


dd if=KTEMP1 of=/dev/null bs=8192

(5.8Gb/s) which means it must already be in cache. Nothing else isgoing on on this system. However, when a program that uses this code(where len_file is again 17453170224)


   buffer=malloc(len_file);
  (void) posix_fadvise(fileno(fin), 0, 0, POSIX_FADV_SEQUENTIAL);
  (void) posix_madvise(buffer, len_file, POSIX_MADV_SEQUENTIAL);
   rlen = fread(buffer, 1, len_file, fin);

is run the fread() takes at least 30 seconds, sometimes longer, for theread to complete. The thing is, "top" shows this (sorry about thewrap):


  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
22501 mathog    20   0 16.3g  13g  520 R 71.3  2.6   0:44.86 binorder

99 root RT 0 0 0 0 S 16.6 0.0 0:08.75migration/24

    3 root      RT   0     0    0    0 S 12.3  0.0   0:24.91 migration/0

What happens is that RES quickly jumps up to about half of VIRT and thenthe two migration processes start up, at which point it crawls.

The numbers after "migration" vary.  dd doesn't run long enough to
trigger whatever this migration business is.  If my test program
is run a couple of times in a row sometimes it completes the read

in about 8 seconds. When that happens the migration processes will notappear.

Through all of this iostat and iotop do not show any IO at all,presumably because it is all going between memory and file cache, withnone of the read being straight from the RAID.

Anyway, using 30s as a nice round number that works out to about 582Mb/sto move this data from one section of memory to another. Which ispretty poor since the stream benchmark shows:


Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5737.4     0.027951     0.027887     0.028254
Scale:           6273.8     0.025557     0.025503     0.025686
Add:             7632.6     0.031513     0.031444     0.031657
Triad:           8948.2     0.026896     0.026821     0.027126

all of which are 10x faster.  Note that the dd time is consistent
with stream's copy benchmark.

Can anybody shed some light on this behavior? In particular, why doesthe OS feel the need to "migrate" something when one of these huge readsis running? Mostly I want to know how to make it behave, leaving theprocess/memory attached to one CPU (but not a particular CPU, justwherever it happens to put it) and not shuffle the data through whatseems to be a 1/10X speed memory pathway. Also, is there really a 1/10Xmemory speed pathway on this big box, or is it just that the migration,whatever that is doing, has a lot of overhead?


Thanks,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

[Beowulf] big read triggers migration and slow memory IO?

Reply via email to