Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)

Stephen Ulmer Mon, 29 Oct 2018 17:40:23 -0700

The point of the original question was to discover why there is a warning about 
performance for nsdChksumTraditional=yes, but that warning doesn’t seem to 
apply to an ESS environment.


Your reply was that checksums in an ESS environment are calculated in parallel 
on the NSD server based on the physical storage layout used underneath the NSD, 
and is thus faster. My point was that if there is never a checksum calculated 
by the NSD client, then how does the NSD server know that it got uncorrupted 
data?

The link you referenced below (thank you!) indicates that, in fact, the NSD 
client DOES calculate a checksum and forward it with the data to the NSD 
server. The server validates the data (necessitating a re-calculation of the 
checksum), and then GNR stores the data, A CHECKSUM[1], and some block metadata 
to media. 

So this leaves us with a checksum calculated by the client and then validated 
(re-calculated) by the server — IN BOTH CASES. For the GNR case, another 
checksum in calculated and stored with the data for another purpose, but that 
means that the nsdChksumTraditional=yes case is exactly like the first phase of 
the GNR case. So why is that case slower when it does less work? Slow enough to 
merit a warning, no less!

I’m really not trying to be a pest, but I have a logic problem with either the 
question or the answer — they aren’t consistent (or I can’t rationalize them to 
be so).

-- 
Stephen

[1] The document is vague (I believe intentionally, because it could have 
easily been made clear) as to whether this is the same checksum or a different 
one. Presumably the server-side-new-checksum is calculated in parallel and 
protects the chunklets or whatever they're called. This is all consistent with 
what you said!



> On Oct 29, 2018, at 5:29 PM, Kumaran Rajaram <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> In non-GNR setup, nsdCksumTraditional=yes enables data-integrity checking 
> between a traditional NSD client node and its NSD server, at the network 
> level only.
> 
> The ESS storage supports end-to-end checksum, NSD client to the ESS IO 
> servers (at the network level) as well as from  ESS IO servers to the 
> disk/storage.  This is further detailed in the docs (link below):
> 
> https://www.ibm.com/support/knowledgecenter/en/SSYSP8_5.3.1/com.ibm.spectrum.scale.raid.v5r01.adm.doc/bl1adv_introe2echecksum.htm
>  
> <https://www.ibm.com/support/knowledgecenter/en/SSYSP8_5.3.1/com.ibm.spectrum.scale.raid.v5r01.adm.doc/bl1adv_introe2echecksum.htm>
> 
> Best,
> -Kums
> 
> 
> 
> 
> 
> From:        Stephen Ulmer <[email protected] <mailto:[email protected]>>
> To:        gpfsug main discussion list <[email protected] 
> <mailto:[email protected]>>
> Date:        10/29/2018 04:52 PM
> Subject:        Re: [gpfsug-discuss] NSD network checksums 
> (nsdCksumTraditional)
> Sent by:        [email protected] 
> <mailto:[email protected]>
> 
> 
> 
> So the ESS checksums that are highly touted as "protecting all the way to the 
> disk surface" completely ignore the transfer between the client and the NSD 
> server? It sounds like you are saying that all of the checksumming done for 
> GNR is internal to GNR and only protects against bit-flips on the disk (and 
> in staging buffers, etc.)
> 
> I’m asking because your explanation completely ignores calculating anything 
> on the NSD client and implies that the client could not participate, given 
> that it does not know about the structure of the vdisks under the NSD — but 
> that has to be a performance factor for both types if the transfer is 
> protected starting at the client — which it is in the case of 
> nsdCksumTraditional which is what we are comparing to ESS checksumming.
> 
> If ESS checksumming doesn’t protect on the wire I’d say that marketing has 
> run amok, because that has *definitely* been implied in meetings for which 
> I’ve been present. In fact, when asked if Spectrum Scale provides 
> checksumming for data in-flight, IBM sales has used it as an ESS up-sell 
> opportunity.
> 
> -- 
> Stephen
> 
> 
> 
> On Oct 29, 2018, at 3:56 PM, Kumaran Rajaram <[email protected] 
> <mailto:[email protected]>> wrote:
> 
> Hi,
> 
> >>How can it be that the I/O performance degradation warning only seems to 
> >>accompany the nsdCksumTraditional setting and not GNR?
> >>Why is there such a penalty for "traditional" environments?
> 
> In GNR IO/NSD servers (ESS IO nodes), the checksums are computed in parallel  
> for a NSD (storage volume/vdisk) across the threads handling each pdisk/drive 
> (that constitutes the vdisk/volume). This is possible since the GNR software 
> on the ESS IO servers is tightly integrated with underlying storage and is 
> aware of the vdisk DRAID configuration (strip-size, pdisk constituting vdisk 
> etc.) to perform parallel checksum operations.  
> 
> In non-GNR + external storage model, the GPFS software on the NSD server(s) 
> does not manage the underlying storage volume (this is done by storage RAID 
> controllers)  and the checksum is computed serially. This would contribute to 
> increase in CPU usage and I/O performance degradation (depending on I/O 
> access patterns, I/O load etc).
> 
> My two cents.
> 
> Regards,
> -Kums
> 
> 
> 
> 
> 
> From:        Aaron Knister <[email protected] 
> <mailto:[email protected]>>
> To:        gpfsug main discussion list <[email protected] 
> <mailto:[email protected]>>
> Date:        10/29/2018 12:34 PM
> Subject:        [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)
> Sent by:        [email protected] 
> <mailto:[email protected]>
> 
> 
> 
> Flipping through the slides from the recent SSUG meeting I noticed that 
> in 5.0.2 one of the features mentioned was the nsdCksumTraditional flag. 
> Reading up on it it seems as though it comes with a warning about 
> significant I/O performance degradation and increase in CPU usage. I 
> also recall that data integrity checking is performed by default with 
> GNR. How can it be that the I/O performance degradation warning only 
> seems to accompany the nsdCksumTraditional setting and not GNR? As 
> someone who knows exactly 0 of the implementation details, I'm just 
> naively assuming that the checksum are being generated (in the same 
> way?) in both cases and transferred to the NSD server. Why is there such 
> a penalty for "traditional" environments?
> 
> -Aaron
> 
> -- 
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org/>
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss 
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] NSD network checksums (nsdCksumTraditional)

Reply via email to