I can tell from past experiences that offloads like what you are proposing are rarely worth it. The set up and tear down of the mappings to allow the data transport are not necessarily cheap. You can avoid that by having a preallocated region, but then you need to copy the data. Fortunately for this case you only need to copy once, since the result will be very small compared to the data.
Then there is the complexity (additional branches, edge cases, etc.) that have to be coded. These become performance sapping as well. Add to this the fact that CPUs are always getting faster, and advancements like extensions to the SIMD instructions mean that the disparity between the offload and just doing the natural thing inline gets ever smaller. At the end of the day, it’s often the case that your “offload” is actually a performance killer. The exceptions to this are when the work is truly expensive. For example, running (in the old days) RSA on an offload engine makes a lot of sense. (I’m not sure it does for elliptic curve crypto though.) Running 3DES (again if you wanted to do that, which you should not) used to make sense. AES used to, but with AES-NI not anymore. I suspect that for SHA2 its a toss up. Fletcher probably does not make sense. If you want to compress, LZJB does not make sense, but GZIP (especially at higher levels) would, if you had such a device. Algorithms are always getting better (newer ones that are more optimized for actual CPUs etc.) and CPUs are always improving — the GPU is probably best reserved for truly expensive operations for which it was designed — complex transforms for 3D rendering, expensive hashing (although I wish that wasn’t a thing), long running scientific analysis, machine learning, etc. As an I/O accelerator, not so much. On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>, wrote: > I've been searching the GitHub Repository and the Mailing list, but couldn't > find any discussion about this. > I know it's probably silly, but I would like to understand the workings. > > Let's say one could offload the Checksumming process to a dedicated GPU. This > might save some amount of CPU, *but* might increase latency incredibly. > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, and > this requires a pass of the data in-memory as it calculates the checksum. If > we skip this step, and instead send the data to the GPU, that would also > require a pass of the data (no gains there). > > The actual calculation is not that hard for a CPU it seems, there are > specific SIMD instructions for calculating specific Checksums, and after a > quick pass over the code, it seems they are already used (if available). > > I think the only time that a GPU could calculate checksums 'faster', is with > a form of readahead. > If you would pre-read a lot of data, and dump it to the GPU's internal > memory, and make the GPU calculate checksums of the entire block in parallel, > it might be able to do it faster than a CPU. > > Has anyone considered the idea? > > - Thijs > openzfs / openzfs-developer / see discussions + participants + delivery > options Permalink ------------------------------------------ openzfs: openzfs-developer Permalink: https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M522b09520eb8e026499c20e8 Delivery options: https://openzfs.topicbox.com/groups/developer/subscription