?Hi Stumped,

Is this MPI job on one machine? Multiple nodes? Are the tiny 8K writes to the 
same file or different ones?


I hate admitting this but I've found something that's got me stumped.

We have a user running an MPI job on the system. Each rank opens up several 
output files to which it writes ASCII debug information. The net result across 
several hundred ranks is an absolute smattering of teeny tiny I/o requests to 
te underlying disks which they don't appreciate. Performance plummets. The I/o 
requests are 30 to 80 bytes in size. What I don't understand is why these write 
requests aren't getting batched up into larger write requests to the underlying 

If I do something like "df if=/dev/zero of=foo bs=8k" on a node I see that the 
nasty unaligned 8k io requests are batched up into nice 1M I/o requests before 
they hit the NSD.

As best I can tell the application isn't doing any fsync's and isn't doing 
direct io to these files.

Can anyone explain why seemingly very similar io workloads appear to result in 
well formed NSD I/O in one case and awful I/o in another?



