I believe the Intel QAT support we have will happily offload gzip for you, though I don't know if it makes any promises about what level equivalent of gzip it hands you back...
- Rich On Mon, Oct 17, 2022 at 12:02 PM Garrett D'Amore <garr...@damore.org> wrote: > That’s about what I would have expected. > > Having an offload for high levels of compression (e.g. GZIP 9 or > something) would be cool, but I don’t think it exists yet. And it would be > hard to write that in a way that doesn’t punish things for the folks who > *don’t* have the offload hardware. > > - Garrett > > On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer < > developer@lists.open-zfs.org>, wrote: > > > > We have been doing regular performance runs using various workloads over > NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression > is enabled for all datasets and zvols in our runs. What we have observed > is, under load, compression consumes the highest CPU cycles, after that it > is a toss up of dnode locking (a well known issue) and other things that > might come into play depending on the protocol. > > At least in our use cases check summing of blocks does not appear to an > issue. > > -Sanjay > > > > > On 10/14/22 10:15 AM, Garrett D'Amore wrote: > > I can tell from past experiences that offloads like what you are proposing > are rarely worth it. The set up and tear down of the mappings to allow the > data transport are not necessarily cheap. You can avoid that by having a > preallocated region, but then you need to copy the data. Fortunately for > this case you only need to copy once, since the result will be very small > compared to the data. > > Then there is the complexity (additional branches, edge cases, etc.) that > have to be coded. These become performance sapping as well. > > Add to this the fact that CPUs are always getting faster, and advancements > like extensions to the SIMD instructions mean that the disparity between > the offload and just doing the natural thing inline gets ever smaller. > > At the end of the day, it’s often the case that your “offload” is actually > a performance killer. > > The exceptions to this are when the work is truly expensive. For example, > running (in the old days) RSA on an offload engine makes a lot of > sense. (I’m not sure it does for elliptic curve crypto though.) Running > 3DES (again if you wanted to do that, which you should not) used to make > sense. AES used to, but with AES-NI not anymore. I suspect that for SHA2 > its a toss up. Fletcher probably does not make sense. If you want to > compress, LZJB does not make sense, but GZIP (especially at higher levels) > would, if you had such a device. > > Algorithms are always getting better (newer ones that are more optimized > for actual CPUs etc.) and CPUs are always improving — the GPU is probably > best reserved for truly expensive operations for which it was designed — > complex transforms for 3D rendering, expensive hashing (although I wish > that wasn’t a thing), long running scientific analysis, machine learning, > etc. > > As an I/O accelerator, not so much. > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com> > <thijs.cra...@gmail.com>, wrote: > > I've been searching the GitHub Repository and the Mailing list, but > couldn't find any discussion about this. > I know it's probably silly, but I would like to understand the workings. > > Let's say one could offload the Checksumming process to a dedicated GPU. > This might save some amount of CPU, *but* might increase latency > incredibly. > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, > and this requires a pass of the data in-memory as it calculates the > checksum. If we skip this step, and instead send the data to the GPU, that > would also require a pass of the data (no gains there). > > The actual calculation is not that hard for a CPU it seems, there are > specific SIMD instructions for calculating specific Checksums, and after a > quick pass over the code, it seems they are already used (if available). > > I think the only time that a GPU could calculate checksums 'faster', is > with a form of readahead. > If you would pre-read a lot of data, and dump it to the GPU's internal > memory, and make the GPU calculate checksums of the entire block in > parallel, it might be able to do it faster than a CPU. > > Has anyone considered the idea? > > - Thijs > > *openzfs <https://openzfs.topicbox.com/latest>* / openzfs-developer / see > discussions <https://openzfs.topicbox.com/groups/developer> + participants > <https://openzfs.topicbox.com/groups/developer/members> + delivery options > <https://openzfs.topicbox.com/groups/developer/subscription> Permalink > <https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747> > ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-Mdd15974624ca67d893ee40c0 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription