Hello, I have a troubling issue with random file corruption using either lustre 1.8.6 (internal Cray lustre) and lustre 2.1 (sonexion - produced by xyratex).
Randomly, our users will come across an issue with files either having 0 size, or being corrupted. The 0 size files are usually ascii files (which are normally created with simple cat and awk statements, serially), while the corrupted files are weather data (grib) files that most of the time are truncated during an untar operation. Other times, the files have blocks filled with zeroes in the middle of the file. The real kicker is that we can not reproduce the problem reliably in order to troubleshoot it. I managed to trigger file truncation after 1500 iterations of untaring the same tar file, but since then, after 30,000 iterations, I haven't been able to reproduce it. When it happens, there are no errors in the logs relating to lustre, and nothing is dumped into /tmp. Has anyone come across this before? I've searched google for weeks, but have only found a few bugs that seem like they might be similar, but are usually related to netcdf and parallel i/o, while our cases of corruption are usually encountered serially. What log settings are suggested to try and capture this phantom while it is happening? Thanks in advance, Jason _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
