Re: Encode UTF-8 optimizations

Karl Williamson Thu, 18 Aug 2016 22:07:55 -0700

On 08/12/2016 09:31 AM, [email protected] wrote:

On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:

On 07/09/2016 05:12 PM, [email protected] wrote:

Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
and Encode::encode("UTF-8", ...) should be used instead. Also opening
file should be done by :encoding(UTF-8) layer instead :utf8.


But UTF-8 strict implementation in Encode module is horrible slow when
comparing to utf8::encode(). It is implemented in Encode.xs file and for
benchmarking can be this XS implementation called directly by:

 use Encode;
 my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)

(without overhead of Encode module...)

Here are my results on 160 bytes long input string:

 Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr 
+

0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)

 Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr 
+

0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)

 utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @

10256410.26/s (n=4000000)


I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
and did some optimizations. Final results are:

 Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr 
+

0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)

 Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr 
+

0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)

 utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @

10000000.00/s (n=4000000)


Patches are on github at pull request:
https://github.com/dankogai/p5-encode/pull/56

I would like if somebody review my patches and tell if this is the
right way for optimizations...


I'm sorry that this slipped off my radar until I saw it in the new Encode
release

There are a couple of things I see wrong with your patch.

1) It does not catch the malformation of an overlong sequence.  This is a
serious malformation which has been used for attacks.  Basically, after you
get the result, you need to check that it is the expected length for that
result.  For example, \xC2\x80 will have an input length of 2, and evaluates
to \x00, whose expected length is 1, and so the input is overlong.  In
modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
passed-in length.  This can be rewritten for perls back to 5.8 using
UNI_SKIP and UNI_TO_NATIVE


I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...

First UTF8_IS_INVARIANT is checked and one character processed.

Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.

If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.

Can you describe in details what do you think it wrong and how to do
that attack?


This discussion has been active at
https://github.com/dankogai/p5-encode/issues/64

For the curious out there, please refer to that discussion. My bottomline is that I have come to believe the security risks are too high tohave modules do their own security checking for UTF-8 correctness.

2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
start, but the result uv needs to be transformed back to native, using
UNI_TO_NATIVE(uv).


uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

As I said on that other thread, if this is really true, then it's fasterto use a boolean function to verify the inputs. Also, performanceshould not be a consideration for errors or warnings. One can checkvalidity fast; and then spend the time getting the message right in therare cases where a message is generated.

3) The assumptions the subroutine runs under need to be documented for
future maintainers and code readers.  For example, it assumes that there is
enough space in the input to hold all the bytes.


Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.

You misunderstand what I meant here. The bottom line is your patch addsa significant amount of code without any comments in a risky area. Thename of the function does not indicate that its value is to be thrownaway, and even after looking at the code that calls it some more, it'snot obvious to me that the value isn't kept. All subtleties in codeshould be commented in that code. To do otherwise is a disservice tofuture maintainers. I personally will never push to blead someone'scommit that I think unfairly burdens future maintainers. One of thesubtleties of this function is that it doesn't check that it is notrunning off the end of the buffer. It relies on the caller to do thatcheck, but someone coming along might see that function and think fromits name that it's more general purpose than it actually is. Someonelooking at its name and return value would likely think it generates avalid code point from UTF-8 input.

Other than that, it looks ok to me.  But, to be sure, I think you should run
it on the tests included in the core t/op/utf8decode.t which came from an
internet repository of edge cases.

I later realized that under non-strict calls, this can overflow, andthere is some code in your amendments to this patch that check that. Ihave not looked those over.

But again, I don't think Encode should undertake its own securitychecking. I'm willing to work with you to get something in core thatadequately meets Encode's needs.


How to use and run that test with Encode?

It looks like you figured that out for the most part in your amendedpatches.

Re: Encode UTF-8 optimizations

Reply via email to