not going to mention much on DDN setups but first thing that makes my eyes 
blurry a bit is


when you mention your whole cluster is already on 4.2.3
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
Luis Bolinches
Consultant IT Specialist
Mobile Phone: +358503112585

"If you always give you will always have" --  Anonymous

From:   John Hanks <griz...@gmail.com>
To:     gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>
Date:   14/02/2018 17:22
Subject:        Re: [gpfsug-discuss] Odd behavior with cat followed by 
Sent by:        gpfsug-discuss-boun...@spectrumscale.org

Hi Valdis,

I tired with the grep replaced with 'ls -ls' and 'md5sum', I don't think 
this is a data integrity issue, thankfully:

$ ./pipetestls.sh 
256 -rw-r--r-- 1 39073 3001 530721 Feb 14 07:16 
0 -rw-r--r-- 1 39073 3953 530721 Feb 14 07:16 

$ ./pipetestmd5.sh 
15cb81a85c9e450bdac8230309453a0a  /srv/gsfs0/projects/pipetest.tmp.txt
15cb81a85c9e450bdac8230309453a0a  /home/griznog/pipetest.tmp.txt

And replacing grep with 'file' even properly sees the files as ASCII:
$ ./pipetestfile.sh 
/srv/gsfs0/projects/pipetest.tmp.txt: ASCII text, with very long lines
/home/griznog/pipetest.tmp.txt: ASCII text, with very long lines

I'll poke a little harder at grep next and see what the difference in 
strace of each reveals.



On Wed, Feb 14, 2018 at 7:08 AM, <valdis.kletni...@vt.edu> wrote:
On Wed, 14 Feb 2018 06:20:32 -0800, John Hanks said:

> #  ls -aln /srv/gsfs0/projects/pipetest.tmp.txt $HOME/pipetest.tmp.txt
> -rw-r--r-- 1 39073 3953 530721 Feb 14 06:10 
> -rw-r--r-- 1 39073 3001 530721 Feb 14 06:10
> /srv/gsfs0/projects/pipetest.tmp.txt
> We can "fix" the user case that exposed this by not using a temp file or
> inserting a sleep, but I'd still like to know why GPFS is behaving this 
> and make it stop.

May be related to replication, or other behind-the-scenes behavior.

Consider this example -, data and metadata replication both
set to 2, 2 sites 95 cable miles apart, each is 3 Dell servers with a full
fiberchannel mesh to 3 Dell MD34something arrays.

% dd if=/dev/zero bs=1k count=4096 of=sync.test; ls -ls sync.test; sleep 
5; ls -ls sync.test; sleep 5; ls -ls sync.test
4096+0 records in
4096+0 records out
4194304 bytes (4.2 MB) copied, 0.0342852 s, 122 MB/s
2048 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test
8192 -rw-r--r-- 1 root root 4194304 Feb 14 09:35 sync.test

Notice that the first /bin/ls shouldn't be starting until after the dd has
completed - at which point it's only allocated half the blocks needed to 
the 4M of data at one site.  5 seconds later, it's allocated the blocks at 
sites and thus shows the full 8M needed for 2 copies.

I've also seen (but haven't replicated it as I write this) a small file 
or so) showing first one full-sized block, then a second full-sized block, 
then dropping back to what's needed for 2 1/32nd fragments.  That had me
scratching my head

Having said that, that's all metadata fun and games, while your case
appears to have some problems with data integrity (which is a whole lot
scarier).  It would be *really* nice if we understood the problem here.

The scariest part is:

> The first grep | wc -l returns 1, because grep outputs  "Binary file 
> gpfs/mount/test matches"

which seems to be implying that we're failing on semantic consistency.
Basically, your 'cat' command is completing and closing the file, but then 
temporally later open of the same find is reading something other that 
only the
just-written data.  My first guess is that it's a race condition similar 
to the
following: The cat command is causing a write on one NSD server, and the 
grep results in a read from a *different* NSD server, returning the data 
*used* to be in the block because the read actually happens before the 
NSD server actually completes the write.

It may be interesting to replace the grep's with pairs of 'ls -ls / dd' 
commands to grab the
raw data and its size, and check the following:

1) does the size (both blocks allocated and logical length) reported by
ls match the amount of data actually read by the dd?

2) Is the file length as actually read equal to the written length, or 
does it
overshoot and read all the way to the next block boundary?

3) If the length is correct, what's wrong with the data that's telling 
grep that
it's a binary file?  ( od -cx is your friend here).

4) If it overshoots, is the remainder all-zeros (good) or does it return 
"what used to be there" data (bad, due to data exposure issues)?

(It's certainly not the most perplexing data consistency issue I've hit in 
4 decades - the
winner *has* to be a intermittent data read corruption on a GPFS 3.5 
cluster that
had us, IBM, SGI, DDN, and at least one vendor of networking gear all 
chasing our
tails for 18 months before we finally tracked it down. :)

gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3 
Registered in Finland
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

Reply via email to