On Thu, Jul 02, 2026 at 08:45:34AM +0000, Leonid Ravich wrote:
> On Wed, Jul 01, 2026 at 12:19:19AM -0700, Eric Biggers wrote:
> > No, this didn't address my feedback.  It moved things around but still
> > adds additional overhead for everyone to support an out-of-tree driver,
> > which also hasn't been shown to be any better than just using the CPU.
> 
> Eric, thanks for the fast reply.
> 
> Overhead: for a non-user the only cost is the data_unit_size field plus
> one zeroing store in set_tfm()/ON_STACK; the en/decrypt paths are
> untouched.

Sure, which is still a cost for everyone.

> A dun() user pays one indirect dispatch into the template per
> request plus a scatterwalk step and IV copy per unit -- the same per-DU
> bookkeeping the consumer already open-codes today.

It's not the same at all.  There's now an extra indirect call, more
per-request memory used, additional overhead to create a scatterlist and
then break it up into multiple ones using the fully generalized
scatterlist walker instead of just creating the correct ones in the
first place from the bio_vec, etc.  We need to be simplifying the crypto
APIs, not making them even more complex and adding more overhead.

> On the driver: I agree pushing code optimized for an out-of-tree driver
> is wrong

So don't do it.

> but I don't think that's the case here -- this helps any async
> crypto engine, and there are in-tree async xts(aes) ones dm-crypt is
> eligible to use today: HiSilicon SEC2, TI DTHEv2, Atmel (I don't have any
> to test on).

It helps nothing, as there is no patch and no benchmarks.

> To bound the win, I used cryptd as a pure async carrier and
> moved the per-DU split inside it, then ran dm-crypt + fio: batching cut
> CPU ~30% on 128k I/O (large batch) and had zero impact on 4k -- so the
> saving is dispatch, not crypto.
>
> A real engine that submits a whole
> multi-DU request in one descriptor avoids that per-DU dispatch entirely,
> so it saves at least that.

That is not an equivalent benchmark at all.

> So the question for me is what the bar is: does landing the API and dun()
> template now (with the in-tree consolidation it already buys dm-crypt and
> blk-crypto-fallback), with a throughput demonstration deferred to a real
> async provider, work for you ?

I definitely don't want this for blk-crypto-fallback.
blk-crypto-fallback already only supports CPU-based crypto acceleration.
(As it should, because nothing else is actually worthwhile these days,
other than hardware inline encryption which is separately supported.)
And I'm also planning to switch it from crypto_skcipher to lib/crypto/
to eliminate the remaining API overhead, which will actually accomplish
that, unlike the dun() thing which just makes it worse.

The bar for adding things to the upstream kernel is that it has to
actually be used *and* beneficial in the upstream kernel.

- Eric

Reply via email to