Re: [PHP-DEV] Inconsistency mbstring functions

youkidearitai Mon, 04 Dec 2023 06:05:55 -0800

2023年12月4日(月) 22:25 Robert Landers <[email protected]>:
>
> On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals
> <[email protected]> wrote:
> >
> > On Sat, Dec 2, 2023 at 6:13 AM Alex <[email protected]> wrote:
> > >
> > > Dear Stefan, and Dear Gina,
> > >
> > > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk 
> > > of the mbstring library. I have been aware of this for a long time, and 
> > > other mbstring developers have too. It dates back to the origin of the 
> > > library; actually, even before the origin of mbstring, since mbstring was 
> > > based on another library called libmbfl, and this behavior originates 
> > > from libmbfl.
> > >
> > > Pull your chair up around the fire and let me tell you the tale of 
> > > libmbfl. Once upon a time, there was a text-processing library called 
> > > libmbfl. libmbfl was based on a collection of text-decoding routines 
> > > (which converted bytes to codepoints) and text-encoding routines (which 
> > > converted codepoints to bytes). Each such routine was structured as a 
> > > stateful "filter". These filters could be assembled into "chains", 
> > > whereby the output values generated by one routine would automatically be 
> > > passed to the next. libmbfl could perform many wonderful text-processing 
> > > tasks by substituting a different final filter at the end of the chain.
> > >
> > > But all was not well. Since libmbfl's filters processed text only one 
> > > byte or codepoint at a time, and each routine had to save its state 
> > > before returning, and restore its state upon entry, libmbfl was slow. 
> > > Slow as a turtle, slow as a snail, slow as 
> > > whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to 
> > > do? A clever plan was hatched: give libmbfl a 256-entry table called a 
> > > "mblen_table" for each supported text encoding with the property that the 
> > > byte length of a character can be determined from its first byte. Then, 
> > > text-processing tasks which were not dependent on the actual content of a 
> > > string, but only on the number of codepoints, could be performed without 
> > > ever invoking those wonderful, but painfully slow filters! libmbfl could 
> > > skip through a string while just examining the first byte of each 
> > > character. (Of course, this only worked for text encodings with an 
> > > mblen_table.) For valid strings, the new method worked identically to the 
> > > previous one. For invalid strings, there were significant differences in 
> > > behavior, but libmbfl tried to ignore these and bravely pressed on.
> > >
> > > The story ends with an ironic twist. Many years later, I became 
> > > interested in mbstring and reimplemented its internals, replacing the 
> > > libmbfl code with fresh new code which ran many times faster. The new 
> > > code was so much faster that in some cases, the mblen_table optimization 
> > > actually became a pessimization! In other cases, the mblen_table-based 
> > > code is still faster, but not by a large amount. But now mbstring was 
> > > haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With 
> > > a huge body of legacy code relying on mbstring, almost any observable 
> > > behavior change runs a significant risk of breaking someone's code. And 
> > > when this happens, they will not hesitate to vent their rage on the 
> > > hapless maintainers.
> > >
> > > Notwithstanding the rage of the users, about a year ago, I did remove the 
> > > mblen_table-based code in one place where benchmarks clearly showed it 
> > > was acting as a pessimization. I don't remember which mbstring function 
> > > was affected and would need to check the commit log to confirm.
> >
> > Hi Alex,
> >
> > Thank you very much for sharing this background context.
> >
> > >
> > > Personally, I think the real issue here is not the inconsistency between 
> > > mbstring functions which are based on the mblen_tables and those which 
> > > are not. I think a lot of mbstring operations should not be used on 
> > > invalid strings at all, and that for such operations, mbstring would do 
> > > well to throw an exception if it receives invalid input. (Like mb_strpos; 
> > > how do you define the "position of a UTF-8 substring" when the parent 
> > > string is not UTF-8 at all?) But that would be a huge BC break.
> > >
> >
> > My biggest concern is that this quirk can cause security issues in
> > user code. I came across this in the first place when discovering an
> > exploitable security vulnerability in an application. From my point of
> > view, this is not only about inconsistent behavior but also violates
> > the documentation for specific functions like mb_strstr. I agree that
> > a lot of mbstring operations should not be used on invalid strings,
> > and an exception seems to be an appropriate answer despite the huge BC
> > impact.
>
> I think it is only a security issue when people accidentally think
> mb_* functions should be used if it is available. I've seen people do
> mb_strlen() on binary data, for example, not realizing the differences
> between mb_strlen and strlen. Or using mb_* functions and then passing
> them off to cryptographic functions.
>
> Robert Landers
> Software Engineer
> Utrecht NL
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>


Hi, Internals.

Sorry if I'm off topic.
I don't know if it will be helpful, Japanese mbstring user if use
these mb_* functions, we use mb_check_encoding.
If character encoding is invalid, then occur error.

```
<?php

if (mb_check_encoding($string, $encoding) !== true)) {
    // throw Exception or exit(not zero)
}

mb_strlen($string);
// do something
```

Anyway, I agree if invalid encoding then mbstring is raise exception.
Because not only people who are familiar with character codes.
However, this is BC break, I think need more discussion.

Regards
Yuya

-- 
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php

Re: [PHP-DEV] Inconsistency mbstring functions

Reply via email to