yes ;) the system is in preproduction, so nothing that can't stopped/started in a few minutes (current setup has only 4 nsds, and no clients). mmfsck triggers the errors very early during inode replica compare.
stijn On 08/02/2017 08:47 PM, Sven Oehme wrote: > How can you reproduce this so quick ? > Did you restart all daemons after that ? > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt <[email protected]> > wrote: > >> hi sven, >> >> >>> the very first thing you should check is if you have this setting set : >> maybe the very first thing to check should be the faq/wiki that has this >> documented? >> >>> >>> mmlsconfig envVar >>> >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>> MLX5_USE_MUTEX 1 >>> >>> if that doesn't come back the way above you need to set it : >>> >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >> i just set this (wasn't set before), but problem is still present. >> >>> >>> there was a problem in the Mellanox FW in various versions that was never >>> completely addressed (bugs where found and fixed, but it was never fully >>> proven to be addressed) the above environment variables turn code on in >> the >>> mellanox driver that prevents this potential code path from being used to >>> begin with. >>> >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale >>> that even you don't set this variables the problem can't happen anymore >>> until then the only choice you have is the envVar above (which btw ships >> as >>> default on all ESS systems). >>> >>> you also should be on the latest available Mellanox FW & Drivers as not >> all >>> versions even have the code that is activated by the environment >> variables >>> above, i think at a minimum you need to be at 3.4 but i don't remember >> the >>> exact version. There had been multiple defects opened around this area, >> the >>> last one i remember was : >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >> dell, and the fw is a bit behind. i'm trying to convince dell to make >> new one. mellanox used to allow to make your own, but they don't anymore. >> >>> >>> 00154843 : ESS ConnectX-3 performance issue - spinning on >> pthread_spin_lock >>> >>> you may ask your mellanox representative if they can get you access to >> this >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>> cards its a general issue that affects all cards and on intel as well as >>> Power. >> ok, thanks for this. maybe such a reference is enough for dell to update >> their firmware. >> >> stijn >> >>> >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt <[email protected]> >>> wrote: >>> >>>> hi all, >>>> >>>> is there any documentation wrt data integrity in spectrum scale: >>>> assuming a crappy network, does gpfs garantee somehow that data written >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>> nsd gpfs daemon to disk. >>>> >>>> and wrt crappy network, what about rdma on crappy network? is it the >> same? >>>> >>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>> network issue; and we see no errors anywhere...) >>>> >>>> thanks a lot, >>>> >>>> stijn >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
