I can tell from past experiences that offloads like what you are proposing are 
rarely worth it.  The set up and tear down of the mappings to allow the data 
transport are not necessarily cheap.  You can avoid that by having a 
preallocated region, but then you need to copy the data.  Fortunately for this 
case you only need to copy once, since the result will be very small compared 
to the data.

Then there is the complexity (additional branches, edge cases, etc.) that have 
to be coded.  These become performance sapping as well.

Add to this the fact that CPUs are always getting faster, and advancements like 
extensions to the SIMD instructions mean that the disparity between the offload 
and just doing the natural thing inline gets ever smaller.

At the end of the day, it’s often the case that your “offload” is actually a 
performance killer.

The exceptions to this are when the work is truly expensive.  For example, 
running (in the old days) RSA on an offload engine makes a lot of sense.  (I’m 
not sure it does for elliptic curve crypto though.)  Running 3DES (again if you 
wanted to do that, which you should not) used to make sense.  AES used to, but 
with AES-NI not anymore.  I suspect that for SHA2 its a toss up.  Fletcher 
probably does not make sense.  If you want to compress, LZJB does not make 
sense, but GZIP (especially at higher levels) would, if you had such a device.

Algorithms are always getting better (newer ones that are more optimized for 
actual CPUs etc.) and CPUs are always improving — the GPU is probably best 
reserved for truly expensive operations for which it was designed — complex 
transforms for 3D rendering, expensive hashing (although I wish that wasn’t a 
thing), long running scientific analysis, machine learning, etc.

As an I/O accelerator, not so much.
On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>, wrote:
> I've been searching the GitHub Repository and the Mailing list, but couldn't 
> find any discussion about this.
> I know it's probably silly, but I would like to understand the workings.
>
> Let's say one could offload the Checksumming process to a dedicated GPU. This 
> might  save some amount of CPU, *but* might increase latency incredibly.
>
> To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, and 
> this requires a pass of the data in-memory as it calculates the checksum. If 
> we skip this step, and instead send the data to the GPU, that would also 
> require a pass of the data (no gains there).
>
> The actual calculation is not that hard for a CPU it seems, there are 
> specific SIMD instructions for calculating specific Checksums, and after a 
> quick pass over the code, it seems they are already used (if available).
>
> I think the only time that a GPU could calculate checksums 'faster', is with 
> a form of readahead.
> If you would  pre-read a lot of data, and dump it to the GPU's internal 
> memory, and make the GPU calculate checksums of the entire block in parallel, 
> it might be able to do it faster than a CPU.
>
> Has anyone considered the idea?
>
> - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M522b09520eb8e026499c20e8
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to