That’s about what I would have expected.

Having an offload for high levels of compression (e.g. GZIP 9 or something) 
would be cool, but I don’t think it exists yet.  And it would be hard to write 
that in a way that doesn’t punish things for the folks who *don’t* have the 
offload hardware.

• Garrett

On Oct 17, 2022, 8:44 AM -0700, Sanjay G Nadkarni via openzfs-developer 
<developer@lists.open-zfs.org>, wrote:
>
>
> We have been doing regular performance runs using various workloads over 
> NFS(v3,v4.1), SMB3, iSCSI and FC16 & 32 for the past few years. Compression 
> is enabled for all datasets and zvols in our runs. What we have observed is, 
> under load, compression consumes the highest CPU cycles, after that it is a 
> toss up of dnode locking (a well known issue) and other things that might 
> come into play depending on the protocol.
>
> At least in our use cases check summing of blocks does not appear to an issue.
>
> -Sanjay
>
>
>
>
> On 10/14/22 10:15 AM, Garrett D'Amore wrote:
> > I can tell from past experiences that offloads like what you are proposing 
> > are rarely worth it.  The set up and tear down of the mappings to allow the 
> > data transport are not necessarily cheap.  You can avoid that by having a 
> > preallocated region, but then you need to copy the data.  Fortunately for 
> > this case you only need to copy once, since the result will be very small 
> > compared to the data.
> >
> > Then there is the complexity (additional branches, edge cases, etc.) that 
> > have to be coded.  These become performance sapping as well.
> >
> > Add to this the fact that CPUs are always getting faster, and advancements 
> > like extensions to the SIMD instructions mean that the disparity between 
> > the offload and just doing the natural thing inline gets ever smaller.
> >
> > At the end of the day, it’s often the case that your “offload” is actually 
> > a performance killer.
> >
> > The exceptions to this are when the work is truly expensive.  For example, 
> > running (in the old days) RSA on an offload engine makes a lot of sense.  
> > (I’m not sure it does for elliptic curve crypto though.)  Running 3DES 
> > (again if you wanted to do that, which you should not) used to make sense.  
> > AES used to, but with AES-NI not anymore.  I suspect that for SHA2 its a 
> > toss up.  Fletcher probably does not make sense.  If you want to compress, 
> > LZJB does not make sense, but GZIP (especially at higher levels) would, if 
> > you had such a device.
> >
> > Algorithms are always getting better (newer ones that are more optimized 
> > for actual CPUs etc.) and CPUs are always improving — the GPU is probably 
> > best reserved for truly expensive operations for which it was designed — 
> > complex transforms for 3D rendering, expensive hashing (although I wish 
> > that wasn’t a thing), long running scientific analysis, machine learning, 
> > etc.
> >
> > As an I/O accelerator, not so much.
> > On Oct 14, 2022, 7:52 AM -0700, Thijs Cramer <thijs.cra...@gmail.com>, 
> > wrote:
> > > I've been searching the GitHub Repository and the Mailing list, but 
> > > couldn't find any discussion about this.
> > > I know it's probably silly, but I would like to understand the workings.
> > >
> > > Let's say one could offload the Checksumming process to a dedicated GPU. 
> > > This might  save some amount of CPU, *but* might increase latency 
> > > incredibly.
> > >
> > > To my understanding ZFS uses the Fletcher4 Checksum Algorithm by default, 
> > > and this requires a pass of the data in-memory as it calculates the 
> > > checksum. If we skip this step, and instead send the data to the GPU, 
> > > that would also require a pass of the data (no gains there).
> > >
> > > The actual calculation is not that hard for a CPU it seems, there are 
> > > specific SIMD instructions for calculating specific Checksums, and after 
> > > a quick pass over the code, it seems they are already used (if available).
> > >
> > > I think the only time that a GPU could calculate checksums 'faster', is 
> > > with a form of readahead.
> > > If you would  pre-read a lot of data, and dump it to the GPU's internal 
> > > memory, and make the GPU calculate checksums of the entire block in 
> > > parallel, it might be able to do it faster than a CPU.
> > >
> > > Has anyone considered the idea?
> > >
> > > - Thijs
> openzfs / openzfs-developer / see discussions + participants + delivery 
> options Permalink

------------------------------------------
openzfs: openzfs-developer
Permalink: 
https://openzfs.topicbox.com/groups/developer/T2be6db01da63a639-M58be0a7684e7d44a39f75747
Delivery options: https://openzfs.topicbox.com/groups/developer/subscription

Reply via email to