Dell PowerEdge T630, PERC H730P, single 11Tb RAID5 array. Xeon CPU E5-2650 cpus with 40 total threads. 512Gb RAM. Centos 6.9. Kernel 2.6.32-696.20.1.el6.x86_64. (This machine is basically a small beowulf in a box.)

Sometimes for no reason that I can discern an IO operation on this machine will stall. Things that should take seconds will run for minutes, or at least until I get tired of waiting and kill them. Here is today's example:

  gunzip -c largeFile.gz > largeFile

producing a 24 Gb file. One job running "nice" on 40 threads (which is all of them) for a few hours, using only 30Gb of RAM. If no other CPU intensive jobs start "top" shows it at 4700-3800. That job is slowly reading largeFile sequentially.

About two hours after largeFile was created this was run:

  wc -l largeFile

and it just sat there for 10 minutes. top showed 100% CPU for the "wc" process. There was nothing else using a significant amount of CPU time, just the one big job and "wc". Killed the wc process and instead did:

  dd if=largeFile bs=8192 | wc -l

and it completed in about 20 seconds.  After that

  wc -l largeFile

also completed, and in only 6.5s.

As far as I can tell largeFile should have been in cache the whole time. Nothing big enough to force it out ran between when it was created and when the wc started. "iostat 1" shows negligible disk activity, just the occasional reads and writes from the long running job, which works by sucking in a chunk of the file, calculating for a while, then emitting a chunk of results to an output file (which is only 320Mb). Using "dd" somehow kicked the system out of this state, forcing largeFile back into cache if it wasn't already there.

There are no warnings or errors in dmesg or /var/log/messages.

Checked the console yesterday and there are no error messages on the console display.

Smartctl status from the disks (SAS) last time it was checked were:
trombone   Mon Feb 12 10:20:22 PST 2018
  SMART status:           P    P    P    P
  Defect list:            0    1    0    2
  Non-medium errors:      1    7   22    3
  Corrected write:        6    1    1    0
  Corrected read:         0    0    0    0
  Uncorrected write:      0    0    0    0
  Uncorrected read:       0    0    0    0
  Age:                    16630 16630 16630 16630

and those values are unchanged after this event. (Another PowerEdge T630 with SAS disks also has the occasional non-medium error and corrected write.)

A script which dumps pretty much all of the information available from the RAID using "megacli" is run periodically. The only difference between a run after the "dd" and one weeks ago are the time stamps, disk temperatures and battery charge levels (by a few percent).

We have three systems that are fairly similar to this one, but only this one has this odd behavior. These IO stalls have been seen on it before. There was a similar issue a couple of days ago, so the system was rebooted then. Apparently that made no difference.

Examined every value in /var/proc/vm and this an another system differ in only the max_map_count value. The problem system has 262144 and the other has 65530. Doesn't seem likely to be the issue.

Checked the hugepage settings and found a difference there. The two systems that don't do this have /sys/kernel/mm/redhat_transparent_hugepage/defrag

always madvise [never]

whereas the system with the issue has:

[always] madvise never

I did not see any other jobs using up CPU time when this was going on, but perhaps the defrag processes sometimes run in a mode where they don't rise much in "top" yet bogs down the IO. In any case, set the problem system to match the other two.

Does this sound like a reasonable cause for the slowdown, or might there be something else going on? (And if so, what?)


David Mathog
Manager, Sequence Analysis Facility, Biology Division, Caltech
Beowulf mailing list, sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit

Reply via email to