Re: Encode UTF-8 optimizations

pali Fri, 19 Aug 2016 20:26:35 -0700

On Thursday 18 August 2016 23:06:27 Karl Williamson wrote:
> On 08/12/2016 09:31 AM, [email protected] wrote:
> >On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> >>On 07/09/2016 05:12 PM, [email protected] wrote:
> >>>Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >>>and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >>>file should be done by :encoding(UTF-8) layer instead :utf8.
> >>>
> >>>But UTF-8 strict implementation in Encode module is horrible slow when
> >>>comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >>>benchmarking can be this XS implementation called directly by:
> >>>
> >>> use Encode;
> >>> my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >>>
> >>>(without overhead of Encode module...)
> >>>
> >>>Here are my results on 160 bytes long input string:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 
> >>> 8.56 usr +
> >0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 
> >>> 1.66 usr +
> >0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
> >>> utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
> >10256410.26/s (n=4000000)
> >>>
> >>>I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >>>and did some optimizations. Final results are:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 
> >>> 3.27 usr +
> >0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 
> >>> 1.68 usr +
> >0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
> >>> utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
> >10000000.00/s (n=4000000)
> >>>
> >>>Patches are on github at pull request:
> >>>https://github.com/dankogai/p5-encode/pull/56
> >>>
> >>>I would like if somebody review my patches and tell if this is the
> >>>right way for optimizations...
> >>>
> >>
> >>I'm sorry that this slipped off my radar until I saw it in the new Encode
> >>release
> >>
> >>There are a couple of things I see wrong with your patch.
> >>
> >>1) It does not catch the malformation of an overlong sequence.  This is a
> >>serious malformation which has been used for attacks.  Basically, after you
> >>get the result, you need to check that it is the expected length for that
> >>result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> >>to \x00, whose expected length is 1, and so the input is overlong.  In
> >>modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> >>passed-in length.  This can be rewritten for perls back to 5.8 using
> >>UNI_SKIP and UNI_TO_NATIVE
> >
> >I do not see where can be a problem. At least I think my patches should
> >be compatible with previous implementation of Encode.xs...
> >
> >First UTF8_IS_INVARIANT is checked and one character processed.
> >
> >Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
> >sequence. And then len-1 characters are checked if they pass test for
> >UTF8_IS_CONTINUATION.
> >
> >If there are less characters then following does not
> >UTF8_IS_CONTINUATION and error is reported. If there are more, then next
> >iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
> >UTF8_IS_START.
> >
> >Can you describe in details what do you think it wrong and how to do
> >that attack?
> 
> This discussion has been active at
> https://github.com/dankogai/p5-encode/issues/64
> 
> For the curious out there, please refer to that discussion.  My bottom line
> is that I have come to believe the security risks are too high to have
> modules do their own security checking for UTF-8 correctness.
> >
> >>2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> >>start, but the result uv needs to be transformed back to native, using
> >>UNI_TO_NATIVE(uv).
> >
> >uv is used just to check if it is valid Unicode code point. Real value
> >is used only for error/warn message. Previous implementation used
> >utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
> 
> As I said on that other thread, if this is really true, then it's faster to
> use a boolean function to verify the inputs.


Value of uv is used only in warn/error message.

> Also, performance should not
> be a consideration for errors or warnings.  One can check validity fast; and
> then spend the time getting the message right in the rare cases where a
> message is generated.

Yes, fully I agree.

> >>3) The assumptions the subroutine runs under need to be documented for
> >>future maintainers and code readers.  For example, it assumes that there is
> >>enough space in the input to hold all the bytes.
> >
> >Function process_utf8 does not assume that. It calls SvGROW to increase
> >buffer size when needed.
> 
> You misunderstand what I meant here.  The bottom line is your patch adds a
> significant amount of code without any comments in a risky area.  The name
> of the function does not indicate that its value is to be thrown away, and
> even after looking at the code that calls it some more, it's not obvious to
> me that the value isn't kept.  All subtleties in code should be commented in
> that code.  To do otherwise is a disservice to future maintainers.  I
> personally will never push to blead someone's commit that I think  unfairly
> burdens future maintainers.  One of the subtleties of this function is that
> it doesn't check that it is not running off the end of the buffer.  It
> relies on the caller to do that check, but someone coming along might see
> that function and think from its name that it's more general purpose than it
> actually is.  Someone looking at its name and return value would likely
> think it generates a valid code point from UTF-8 input.

It is intended to do not add another non-needed checks as it again slow
down function performance. Checks are done before on another place and
that function is designed to be used only in Encode's process_utf8.

So rather write comments/description about that function. But on the
other hand, I dislike comments which just write what is function doing
in case that comments are longer then function code itself. Then it is
easier to read function code itself and so whole comment is useless...

> >>Other than that, it looks ok to me.  But, to be sure, I think you should run
> >>it on the tests included in the core t/op/utf8decode.t which came from an
> >>internet repository of edge cases.
> 
> I later realized that under non-strict calls, this can overflow, and there
> is some code in your amendments to this patch that check that.  I have not
> looked those over.

Yes, check for overlong and overflow is there...

> But again, I don't think Encode should undertake its own security checking.
> I'm willing to work with you to get something in core that adequately meets
> Encode's needs.

Ok. We can discuss about it. First I see there big problems:

1) Encode module is for Perl 5.8+. In 5.8+ perl's versions will never be
   your (new) functions, so Encode module needs to have at least copy of
   them. And this does not solve problem which you want to prevent :-(

2) Functions must be defined and declared in some header file. So C
   compiler can inline and optimize them in Encode module. Calling such
   hot function via shared library must be avoided.

3) Such function needs to be designed in way that it do only what is
   needed in case for Encode module. On the other hand I do not think it
   is good idea to create special function just for Encode module living
   in core perl... But maybe general-useful function can be designed.

> >How to use and run that test with Encode?
> 
> It looks like you figured that out for the most part in your amended
> patches.

Yes, meanwhile you wrote reply, I figured out about those problems and
also copied those tests :-) Anyway, test for EBCDIC are missing in core.

Re: Encode UTF-8 optimizations

Reply via email to