That’s about what I would have expected. Having an offload for high levels of compression (e.g. GZIP 9 or something) would be cool, but I don’t think it exists yet. And it would be hard to write that in a way that doesn’t punish things for the folks who *don’t* have the offload hardware.
• Garrett On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer <developer@lists.open-zfs.org>, wrote: > > > We have been doing regular performance runs using various workloads over > NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression > is enabled for all datasets and zvols in our runs. What we have observed is, > under load, compression consumes the highest CPU cycles, after that it is a > toss up of dnode locking (a well known issue) and other things that might > come into play depending on the protocol. > > At least in our use cases check summing of blocks does not appear to an issue. > > -Sanjay > > > > > On 10/14/22 10:15 AM, Garrett D'Amore wrote: > > I can tell from past experiences that offloads like what you are proposing > > are rarely worth it. The set up and tear down of the mappings to allow the > > data transport are not necessarily cheap. You can avoid that by having a > > preallocated region, but then you need to copy the data. Fortunately for > > this case you only need to copy once, since the result will be very small > > compared to the data. > > > > Then there is the complexity (additional branches, edge cases, etc.) that > > have to be coded. These become performance sapping as well. > > > > Add to this the fact that CPUs are always getting faster, and advancements > > like extensions to the SIMD instructions mean that the disparity between > > the offload and just doing the natural thing inline gets ever smaller. > > > > At the end of the day, it’s often the case that your “offload” is actually > > a performance killer. > > > > The exceptions to this are when the work is truly expensive. For example, > > running (in the old days) RSA on an offload engine makes a lot of sense. > > (I’m not sure it does for elliptic curve crypto though.) Running 3DES > > (again if you wanted to do that, which you should not) used to make sense. > > AES used to, but with AES-NI not anymore. I suspect that for SHA2 its a > > toss up. Fletcher probably does not make sense. If you want to compress, > > LZJB does not make sense, but GZIP (especially at higher levels) would, if > > you had such a device. > > > > Algorithms are always getting better (newer ones that are more optimized > > for actual CPUs etc.) and CPUs are always improving — the GPU is probably > > best reserved for truly expensive operations for which it was designed — > > complex transforms for 3D rendering, expensive hashing (although I wish > > that wasn’t a thing), long running scientific analysis, machine learning, > > etc. > > > > As an I/O accelerator, not so much. > > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>, > > wrote: > > > I've been searching the GitHub Repository and the Mailing list, but > > > couldn't find any discussion about this. > > > I know it's probably silly, but I would like to understand the workings. > > > > > > Let's say one could offload the Checksumming process to a dedicated GPU. > > > This might save some amount of CPU, *but* might increase latency > > > incredibly. > > > > > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, > > > and this requires a pass of the data in-memory as it calculates the > > > checksum. If we skip this step, and instead send the data to the GPU, > > > that would also require a pass of the data (no gains there). > > > > > > The actual calculation is not that hard for a CPU it seems, there are > > > specific SIMD instructions for calculating specific Checksums, and after > > > a quick pass over the code, it seems they are already used (if available). > > > > > > I think the only time that a GPU could calculate checksums 'faster', is > > > with a form of readahead. > > > If you would pre-read a lot of data, and dump it to the GPU's internal > > > memory, and make the GPU calculate checksums of the entire block in > > > parallel, it might be able to do it faster than a CPU. > > > > > > Has anyone considered the idea? > > > > > > - Thijs > openzfs / openzfs-developer / see discussions + participants + delivery > options Permalink ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription