The POSIX standard pretty clearly allows short writes to occur (number of bytes written less than requested in a successful call to write) but its not something you see very often and I dont think many users/applications expect it to occur when writing to disk based files. We are seeing it fairly regularly and just wanted to confirm that we (rather our users) should expect this behaviour from Lustre.
We are seeing the issue with the infamous Gaussian quantum chem code which spends literally days constantly writing and reading to scratch files in roughly 1GB chunks as part of out-of-core solvers. We manage jobs using simple SIGSTOP/SIGCONT based suspend/resume and occasionally jobs will flag a short write immediately after a SIGCONT. The application incorrectly treats this as an error and aborts. Adding code to complete the write appears to fix the problem (as you'd hope). Now we are at the stage of "debating" with the application developers whether it's their problem or Lustre's. Is this considered normal Lustre behaviour? This is with 1.8.3 clients on 2.6.27.46. Thanks, David _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
