2023年12月4日(月) 22:25 Robert Landers <[email protected]>:
>
> On Mon, Dec 4, 2023 at 1:51 PM Stefan Schiller via internals
> <[email protected]> wrote:
> >
> > On Sat, Dec 2, 2023 at 6:13 AM Alex <[email protected]> wrote:
> > >
> > > Dear Stefan, and Dear Gina,
> > >
> > > Thanks for the message. Yes, Stefan has rediscovered an interesting quirk
> > > of the mbstring library. I have been aware of this for a long time, and
> > > other mbstring developers have too. It dates back to the origin of the
> > > library; actually, even before the origin of mbstring, since mbstring was
> > > based on another library called libmbfl, and this behavior originates
> > > from libmbfl.
> > >
> > > Pull your chair up around the fire and let me tell you the tale of
> > > libmbfl. Once upon a time, there was a text-processing library called
> > > libmbfl. libmbfl was based on a collection of text-decoding routines
> > > (which converted bytes to codepoints) and text-encoding routines (which
> > > converted codepoints to bytes). Each such routine was structured as a
> > > stateful "filter". These filters could be assembled into "chains",
> > > whereby the output values generated by one routine would automatically be
> > > passed to the next. libmbfl could perform many wonderful text-processing
> > > tasks by substituting a different final filter at the end of the chain.
> > >
> > > But all was not well. Since libmbfl's filters processed text only one
> > > byte or codepoint at a time, and each routine had to save its state
> > > before returning, and restore its state upon entry, libmbfl was slow.
> > > Slow as a turtle, slow as a snail, slow as
> > > whatever-slowly-moving-thing-you-can-think-of. Oh, what was libmbfl to
> > > do? A clever plan was hatched: give libmbfl a 256-entry table called a
> > > "mblen_table" for each supported text encoding with the property that the
> > > byte length of a character can be determined from its first byte. Then,
> > > text-processing tasks which were not dependent on the actual content of a
> > > string, but only on the number of codepoints, could be performed without
> > > ever invoking those wonderful, but painfully slow filters! libmbfl could
> > > skip through a string while just examining the first byte of each
> > > character. (Of course, this only worked for text encodings with an
> > > mblen_table.) For valid strings, the new method worked identically to the
> > > previous one. For invalid strings, there were significant differences in
> > > behavior, but libmbfl tried to ignore these and bravely pressed on.
> > >
> > > The story ends with an ironic twist. Many years later, I became
> > > interested in mbstring and reimplemented its internals, replacing the
> > > libmbfl code with fresh new code which ran many times faster. The new
> > > code was so much faster that in some cases, the mblen_table optimization
> > > actually became a pessimization! In other cases, the mblen_table-based
> > > code is still faster, but not by a large amount. But now mbstring was
> > > haunted by the spectre of Hyrum's Law (https://www.hyrumslaw.com/). With
> > > a huge body of legacy code relying on mbstring, almost any observable
> > > behavior change runs a significant risk of breaking someone's code. And
> > > when this happens, they will not hesitate to vent their rage on the
> > > hapless maintainers.
> > >
> > > Notwithstanding the rage of the users, about a year ago, I did remove the
> > > mblen_table-based code in one place where benchmarks clearly showed it
> > > was acting as a pessimization. I don't remember which mbstring function
> > > was affected and would need to check the commit log to confirm.
> >
> > Hi Alex,
> >
> > Thank you very much for sharing this background context.
> >
> > >
> > > Personally, I think the real issue here is not the inconsistency between
> > > mbstring functions which are based on the mblen_tables and those which
> > > are not. I think a lot of mbstring operations should not be used on
> > > invalid strings at all, and that for such operations, mbstring would do
> > > well to throw an exception if it receives invalid input. (Like mb_strpos;
> > > how do you define the "position of a UTF-8 substring" when the parent
> > > string is not UTF-8 at all?) But that would be a huge BC break.
> > >
> >
> > My biggest concern is that this quirk can cause security issues in
> > user code. I came across this in the first place when discovering an
> > exploitable security vulnerability in an application. From my point of
> > view, this is not only about inconsistent behavior but also violates
> > the documentation for specific functions like mb_strstr. I agree that
> > a lot of mbstring operations should not be used on invalid strings,
> > and an exception seems to be an appropriate answer despite the huge BC
> > impact.
>
> I think it is only a security issue when people accidentally think
> mb_* functions should be used if it is available. I've seen people do
> mb_strlen() on binary data, for example, not realizing the differences
> between mb_strlen and strlen. Or using mb_* functions and then passing
> them off to cryptographic functions.
>
> Robert Landers
> Software Engineer
> Utrecht NL
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: https://www.php.net/unsub.php
>
Hi, Internals.
Sorry if I'm off topic.
I don't know if it will be helpful, Japanese mbstring user if use
these mb_* functions, we use mb_check_encoding.
If character encoding is invalid, then occur error.
```
<?php
if (mb_check_encoding($string, $encoding) !== true)) {
// throw Exception or exit(not zero)
}
mb_strlen($string);
// do something
```
Anyway, I agree if invalid encoding then mbstring is raise exception.
Because not only people who are familiar with character codes.
However, this is BC break, I think need more discussion.
Regards
Yuya
--
---------------------------
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-----------------------------
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php