Re: [developer] GPU Accelerated Checksumming

Rich Mon, 17 Oct 2022 12:15:24 -0700

I believe the Intel QAT support we have will happily offload gzip for you,
though I don't know if it makes any promises about what level equivalent of
gzip it hands you back...


- Rich

On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore <garr...@damore.org> wrote:

> That’s about what I would have expected.
>
> Having an offload for high levels of compression (e.g. GZIP 9 or
> something) would be cool, but I don’t think it exists yet.  And it would be
> hard to write that in a way that doesn’t punish things for the folks who
> *don’t* have the offload hardware.
>
>    - Garrett
>
> On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <
> developer@lists.open-zfs.org>, wrote:
>
>
>
> We have been doing regular performance runs using various workloads over
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression
> is enabled for all datasets and zvols in our runs. What we have observed
> is, under load, compression consumes the highest CPU cycles, after that it
> is a toss up of dnode locking (a well known issue) and other things that
> might come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an
> issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
>
> I can tell from past experiences that offloads like what you are proposing
> are rarely worth it.  The set up and tear down of the mappings to allow the
> data transport are not necessarily cheap.  You can avoid that by having a
> preallocated region, but then you need to copy the data.  Fortunately for
> this case you only need to copy once, since the result will be very small
> compared to the data.
>
> Then there is the complexity (additional branches, edge cases, etc.) that
> have to be coded.  These become performance sapping as well.
>
> Add to this the fact that CPUs are always getting faster, and advancements
> like extensions to the SIMD instructions mean that the disparity between
> the offload and just doing the natural thing inline gets ever smaller.
>
> At the end of the day, it’s often the case that your “offload” is actually
> a performance killer.
>
> The exceptions to this are when the work is truly expensive.  For example,
> running (in the old days) RSA on an offload engine makes a lot of
> sense.  (I’m not sure it does for elliptic curve crypto though.)  Running
> 3DES (again if you wanted to do that, which you should not) used to make
> sense.  AES used to, but with AES-NI not anymore.  I suspect that for SHA2
> its a toss up.  Fletcher probably does not make sense.  If you want to
> compress, LZJB does not make sense, but GZIP (especially at higher levels)
> would, if you had such a device.
>
> Algorithms are always getting better (newer ones that are more optimized
> for actual CPUs etc.) and CPUs are always improving — the GPU is probably
> best reserved for truly expensive operations for which it was designed —
> complex transforms for 3D rendering, expensive hashing (although I wish
> that wasn’t a thing), long running scientific analysis, machine learning,
> etc.
>
> As an I/O accelerator, not so much.
> On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>
> <thijs.cra...@gmail.com>, wrote:
>
> I've been searching the GitHub Repository and the Mailing list, but
> couldn't find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU.
> This might  save some amount of CPU, *but* might increase latency
> incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default,
> and this requires a pass of the data in-memory as it calculates the
> checksum. If we skip this step, and instead send the data to the GPU, that
> would also require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are
> specific SIMD instructions for calculating specific Checksums, and after a
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is
> with a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal
> memory, and make the GPU calculate checksums of the entire block in
> parallel, it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
>
> *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see
> discussions <https://openzfs.topicbox.com/groups/developer> + participants
> <https://openzfs.topicbox.com/groups/developer/members> + delivery options
> <https://openzfs.topicbox.com/groups/developer/subscription> Permalink
> <https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747>
>

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Re: [developer] GPU Accelerated Checksumming

Reply via email to