On Wed, 14 Feb 2018 06:20:32 -0800, John Hanks said: > # ls -aln /srv/gsfs0/projects/pipetest.tmp.txt $HOME/pipetest.tmp.txt > -rw-r--r-- 1 39073 3953 530721 Feb 14 06:10 /home/griznog/pipetest.tmp.txt > -rw-r--r-- 1 39073 3001 530721 Feb 14 06:10 > /srv/gsfs0/projects/pipetest.tmp.txt > > We can "fix" the user case that exposed this by not using a temp file or > inserting a sleep, but I'd still like to know why GPFS is behaving this way > and make it stop.
May be related to replication, or other behind-the-scenes behavior. Consider this example - 220.127.116.11, data and metadata replication both set to 2, 2 sites 95 cable miles apart, each is 3 Dell servers with a full fiberchannel mesh to 3 Dell MD34something arrays. % dd if=/dev/zero bs=1k count=4096 of=sync.test; ls -ls sync.test; sleep 5; ls -ls sync.test; sleep 5; ls -ls sync.test 4096+0 records in 4096+0 records out 4194304 bytes (4.2 MB) copied, 0.0342852 s, 122 MB/s 2048 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test 8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test 8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test Notice that the first /bin/ls shouldn't be starting until after the dd has completed - at which point it's only allocated half the blocks needed to hold the 4M of data at one site. 5 seconds later, it's allocated the blocks at both sites and thus shows the full 8M needed for 2 copies. I've also seen (but haven't replicated it as I write this) a small file (4-8K or so) showing first one full-sized block, then a second full-sized block, and then dropping back to what's needed for 2 1/32nd fragments. That had me scratching my head Having said that, that's all metadata fun and games, while your case appears to have some problems with data integrity (which is a whole lot scarier). It would be *really* nice if we understood the problem here. The scariest part is: > The first grep | wc -l returns 1, because grep outputs "Binary file /path/to/ > gpfs/mount/test matches" which seems to be implying that we're failing on semantic consistency. Basically, your 'cat' command is completing and closing the file, but then a temporally later open of the same find is reading something other that only the just-written data. My first guess is that it's a race condition similar to the following: The cat command is causing a write on one NSD server, and the first grep results in a read from a *different* NSD server, returning the data that *used* to be in the block because the read actually happens before the first NSD server actually completes the write. It may be interesting to replace the grep's with pairs of 'ls -ls / dd' commands to grab the raw data and its size, and check the following: 1) does the size (both blocks allocated and logical length) reported by ls match the amount of data actually read by the dd? 2) Is the file length as actually read equal to the written length, or does it overshoot and read all the way to the next block boundary? 3) If the length is correct, what's wrong with the data that's telling grep that it's a binary file? ( od -cx is your friend here). 4) If it overshoots, is the remainder all-zeros (good) or does it return semi-random "what used to be there" data (bad, due to data exposure issues)? (It's certainly not the most perplexing data consistency issue I've hit in 4 decades - the winner *has* to be a intermittent data read corruption on a GPFS 3.5 cluster that had us, IBM, SGI, DDN, and at least one vendor of networking gear all chasing our tails for 18 months before we finally tracked it down. :)
Description: PGP signature
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss