This big Dell (PowerEdge T620/03GCP, 48 CPUs, >500Gb RAM) keeps throwing me curve balls.

On the RAID file system there are a bunch of files having about 17453170224 bytes. (Slightly different numbers of fixed length records.) At one level these bytes move around very quickly, this takes only 3 seconds:

dd if=KTEMP1 of=/dev/null bs=8192

(5.8Gb/s) which means it must already be in cache. Nothing else is going on on this system. However, when a program that uses this code (where len_file is again 17453170224)

   buffer=malloc(len_file);
  (void) posix_fadvise(fileno(fin), 0, 0, POSIX_FADV_SEQUENTIAL);
  (void) posix_madvise(buffer, len_file, POSIX_MADV_SEQUENTIAL);
   rlen = fread(buffer, 1, len_file, fin);

is run the fread() takes at least 30 seconds, sometimes longer, for the read to complete. The thing is, "top" shows this (sorry about the wrap):

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
22501 mathog    20   0 16.3g  13g  520 R 71.3  2.6   0:44.86 binorder
99 root RT 0 0 0 0 S 16.6 0.0 0:08.75 migration/24
    3 root      RT   0     0    0    0 S 12.3  0.0   0:24.91 migration/0

What happens is that RES quickly jumps up to about half of VIRT and then the two migration processes start up, at which point it crawls.
The numbers after "migration" vary.  dd doesn't run long enough to
trigger whatever this migration business is.  If my test program
is run a couple of times in a row sometimes it completes the read
in about 8 seconds. When that happens the migration processes will not appear.

Through all of this iostat and iotop do not show any IO at all, presumably because it is all going between memory and file cache, with none of the read being straight from the RAID.

Anyway, using 30s as a nice round number that works out to about 582Mb/s to move this data from one section of memory to another. Which is pretty poor since the stream benchmark shows:

Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:            5737.4     0.027951     0.027887     0.028254
Scale:           6273.8     0.025557     0.025503     0.025686
Add:             7632.6     0.031513     0.031444     0.031657
Triad:           8948.2     0.026896     0.026821     0.027126

all of which are 10x faster.  Note that the dd time is consistent
with stream's copy benchmark.

Can anybody shed some light on this behavior? In particular, why does the OS feel the need to "migrate" something when one of these huge reads is running? Mostly I want to know how to make it behave, leaving the process/memory attached to one CPU (but not a particular CPU, just wherever it happens to put it) and not shuffle the data through what seems to be a 1/10X speed memory pathway. Also, is there really a 1/10X memory speed pathway on this big box, or is it just that the migration, whatever that is doing, has a lot of overhead?

Thanks,

David Mathog
[email protected]
Manager, Sequence Analysis Facility, Biology Division, Caltech
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to