hi sven, > ok, you can't be any newer that that. i just wonder why you have 512b > inodes if this is a new system ? because we rsynced 100M files to it ;) it's supposed to replace another system.
> are this raw disks in this setup or raid controllers ? raid (DDP on MD3460) > whats the disk sector size euhm, you mean the luns? for metadata disks (SSD in raid 1): > # parted /dev/mapper/f1v01e0g0_Dm01o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0g0_Dm01o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0g0_Dm01o0: 219GB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 219GB 219GB GPFS: hidden for data disks (DDP) > [root@nsd01 ~]# parted /dev/mapper/f1v01e0p0_S17o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0p0_S17o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0p0_S17o0: 35.2TB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 35.2TB 35.2TB GPFS: hidden > > (parted) q and how was the filesystem created (mmlsfs FSNAME would show > answer to the last question) > # mmlsfs somefilesystem > flag value description > ------------------- ------------------------ > ----------------------------------- > -f 16384 Minimum fragment size in bytes > (system pool) > 262144 Minimum fragment size in bytes > (other pools) > -i 4096 Inode size in bytes > -I 32768 Indirect block size in bytes > -m 2 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data replicas > -R 2 Maximum number of data replicas > -j scatter Block allocation type > -D nfs4 File locking semantics in effect > -k all ACL semantics in effect > -n 850 Estimated number of nodes that > will mount file system > -B 524288 Block size (system pool) > 8388608 Block size (other pools) > -Q user;group;fileset Quotas accounting enabled > user;group;fileset Quotas enforced > none Default quotas enabled > --perfileset-quota Yes Per-fileset quota enforcement > --filesetdf Yes Fileset df enabled? > -V 17.00 (4.2.3.0) File system version > --create-time Wed May 31 12:54:00 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E No Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation option > --fastea Yes Fast external attributes enabled? > --encryption No Encryption enabled? > --inode-limit 313524224 Maximum number of inodes in all > inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > --subblocks-per-full-block 32 Number of subblocks per full > block > -P system;MD3260 Disk storage pools in file system > -d > f0v00e0g0_Sm00o0;f0v00e0p0_S00o0;f1v01e0g0_Sm01o0;f1v01e0p0_S01o0;f0v02e0g0_Sm02o0;f0v02e0p0_S02o0;f1v03e0g0_Sm03o0;f1v03e0p0_S03o0;f0v04e0g0_Sm04o0;f0v04e0p0_S04o0; > -d > f1v05e0g0_Sm05o0;f1v05e0p0_S05o0;f0v06e0g0_Sm06o0;f0v06e0p0_S06o0;f1v07e0g0_Sm07o0;f1v07e0p0_S07o0;f0v00e0g0_Sm08o1;f0v00e0p0_S08o1;f1v01e0g0_Sm09o1;f1v01e0p0_S09o1; > -d > f0v02e0g0_Sm10o1;f0v02e0p0_S10o1;f1v03e0g0_Sm11o1;f1v03e0p0_S11o1;f0v04e0g0_Sm12o1;f0v04e0p0_S12o1;f1v05e0g0_Sm13o1;f1v05e0p0_S13o1;f0v06e0g0_Sm14o1;f0v06e0p0_S14o1; > -d > f1v07e0g0_Sm15o1;f1v07e0p0_S15o1;f0v00e0p0_S16o0;f1v01e0p0_S17o0;f0v02e0p0_S18o0;f1v03e0p0_S19o0;f0v04e0p0_S20o0;f1v05e0p0_S21o0;f0v06e0p0_S22o0;f1v07e0p0_S23o0; > -d > f0v00e0p0_S24o1;f1v01e0p0_S25o1;f0v02e0p0_S26o1;f1v03e0p0_S27o1;f0v04e0p0_S28o1;f1v05e0p0_S29o1;f0v06e0p0_S30o1;f1v07e0p0_S31o1 > Disks in file system > -A no Automatic mount option > -o none Additional mount options > -T /scratch Default mount point > --mount-priority 0 > > on the tsdbfs i am not sure if it gave wrong results, but it would be worth > a test to see whats actually on the disk . ok. i'll try this tomorrow. > > you are correct that GNR extends this to the disk, but the network part is > covered by the nsdchecksums you turned on > when you enable the not to be named checksum parameter do you actually > still get an error from fsck ? hah, no, we don't. mmfsck says the filesystem is clean. we found this odd, so we already asked ibm support about this but no answer yet. stijn > > sven > > > On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt <[email protected]> > wrote: > >> hi sven, >> >>> before i answer the rest of your questions, can you share what version of >>> GPFS exactly you are on mmfsadm dump version would be best source for >> that. >> it returns >> Build branch "4.2.3.3 ". >> >>> if you have 2 inodes and you know the exact address of where they are >>> stored on disk one could 'dd' them of the disk and compare if they are >>> really equal. >> ok, i can try that later. are you suggesting that the "tsdbfs comp" >> might gave wrong results? because we ran that and got eg >> >>> # tsdbfs somefs comp 7:5137408 25:221785088 1024 >>> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = >> 0x19:D382C00: >>> All sectors identical >> >> >>> we only support checksums when you use GNR based systems, they cover >>> network as well as Disk side for that. >>> the nsdchecksum code you refer to is the one i mentioned above thats only >>> supported with GNR at least i am not aware that we ever claimed it to be >>> supported outside of it, but i can check that. >> ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, >> and they are not in the same gpfs cluster. >> >> i thought the GNR extended the checksumming to disk, and that it was >> already there for the network part. thanks for clearing this up. but >> that is worse then i thought... >> >> stijn >> >>> >>> sven >>> >>> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt <[email protected] >>> >>> wrote: >>> >>>> hi sven, >>>> >>>> the data is not corrupted. mmfsck compares 2 inodes, says they don't >>>> match, but checking the data with tbdbfs reveals they are equal. >>>> (one replica has to be fetched over the network; the nsds cannot access >>>> all disks) >>>> >>>> with some nsdChksum... settings we get during this mmfsck a lot of >>>> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >>>> >>>> ibm support says these are hardware issues, but wrt to mmfsck false >>>> positives. >>>> >>>> anyway, our current question is: if these are hardware issues, is there >>>> anything in gpfs client->nsd (on the network side) that would detect >>>> such errors. ie can we trust the data (and metadata). >>>> i was under the impression that client to disk is not covered, but i >>>> assumed that at least client to nsd (the network part) was checksummed. >>>> >>>> stijn >>>> >>>> >>>> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>>>> ok, i think i understand now, the data was already corrupted. the >> config >>>>> change i proposed only prevents a potentially known future on the wire >>>>> corruption, this will not fix something that made it to the disk >> already. >>>>> >>>>> Sven >>>>> >>>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < >> [email protected] >>>>> >>>>> wrote: >>>>> >>>>>> yes ;) >>>>>> >>>>>> the system is in preproduction, so nothing that can't stopped/started >> in >>>>>> a few minutes (current setup has only 4 nsds, and no clients). >>>>>> mmfsck triggers the errors very early during inode replica compare. >>>>>> >>>>>> >>>>>> stijn >>>>>> >>>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>>>> How can you reproduce this so quick ? >>>>>>> Did you restart all daemons after that ? >>>>>>> >>>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < >> [email protected] >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> hi sven, >>>>>>>> >>>>>>>> >>>>>>>>> the very first thing you should check is if you have this setting >>>> set : >>>>>>>> maybe the very first thing to check should be the faq/wiki that has >>>> this >>>>>>>> documented? >>>>>>>> >>>>>>>>> >>>>>>>>> mmlsconfig envVar >>>>>>>>> >>>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF >> 1 >>>>>>>>> MLX5_USE_MUTEX 1 >>>>>>>>> >>>>>>>>> if that doesn't come back the way above you need to set it : >>>>>>>>> >>>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>>>> i just set this (wasn't set before), but problem is still present. >>>>>>>> >>>>>>>>> >>>>>>>>> there was a problem in the Mellanox FW in various versions that was >>>>>> never >>>>>>>>> completely addressed (bugs where found and fixed, but it was never >>>>>> fully >>>>>>>>> proven to be addressed) the above environment variables turn code >> on >>>> in >>>>>>>> the >>>>>>>>> mellanox driver that prevents this potential code path from being >>>> used >>>>>> to >>>>>>>>> begin with. >>>>>>>>> >>>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>>>> Scale >>>>>>>>> that even you don't set this variables the problem can't happen >>>> anymore >>>>>>>>> until then the only choice you have is the envVar above (which btw >>>>>> ships >>>>>>>> as >>>>>>>>> default on all ESS systems). >>>>>>>>> >>>>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>> not >>>>>>>> all >>>>>>>>> versions even have the code that is activated by the environment >>>>>>>> variables >>>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >>>> remember >>>>>>>> the >>>>>>>>> exact version. There had been multiple defects opened around this >>>> area, >>>>>>>> the >>>>>>>>> last one i remember was : >>>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards >> from >>>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to >> make >>>>>>>> new one. mellanox used to allow to make your own, but they don't >>>>>> anymore. >>>>>>>> >>>>>>>>> >>>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>>>> pthread_spin_lock >>>>>>>>> >>>>>>>>> you may ask your mellanox representative if they can get you access >>>> to >>>>>>>> this >>>>>>>>> defect. while it was found on ESS , means on PPC64 and with >>>> ConnectX-3 >>>>>>>>> cards its a general issue that affects all cards and on intel as >> well >>>>>> as >>>>>>>>> Power. >>>>>>>> ok, thanks for this. maybe such a reference is enough for dell to >>>> update >>>>>>>> their firmware. >>>>>>>> >>>>>>>> stijn >>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>>>> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> hi all, >>>>>>>>>> >>>>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>>>> written >>>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >>>> the >>>>>>>>>> nsd gpfs daemon to disk. >>>>>>>>>> >>>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it >> the >>>>>>>> same? >>>>>>>>>> >>>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >>>> it's >>>>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>>>> >>>>>>>>>> thanks a lot, >>>>>>>>>> >>>>>>>>>> stijn >>>>>>>>>> _______________________________________________ >>>>>>>>>> gpfsug-discuss mailing list >>>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> gpfsug-discuss mailing list >>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
