Bug#1032173: identity recoding is too identical

Zefram Thu, 02 Mar 2023 15:33:16 -0800

Reuben Thomas wrote:
>I agree that this behaviour is undesirable. Unfortunately it's deep-seated.
>I have done some work on it, but I don't yet have something I can release.


Oh cool, so I'm not the first person to notice, and there's already been
some progress.

Quick thought about a way this could be tackled: internally you could
explicitly represent the input checking step distinct from a "mere copy"
operation, you interpret "UTF-8..UTF-8" as a checking step, and then
some checking steps can be optimised out of the operation sequence.
Checking that the input conforms to particular charset immediately after
conversion to that same charset can be optimised out, checking conformance
to any 8-bit single-byte charset is null and can be optimised out, and
there are some cases where checks for different charsets are equivalent.

Further refinement of the above: in some cases there might be value
in splitting a conversion step into a checking step followed by a
non-checking conversion.  The value here is that that checking step
might then be able to be optimised out depending on the prior step of
the pipeline.  At a later stage of optimisation, maybe the checking
step and non-checking conversion recombine into an ordinary checking
conversion of the kind you already have.

-zefram

Bug#1032173: identity recoding is too identical

Reply via email to