Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-22 Thread Tim Starling
On 19/9/21 12:33 am, tyson andre wrote:
> When implementing this, Zend/Optimizer/sccp.c has optimizations for functions 
> such as str_contains, etc to optimize.
> After removing locale dependence, those optimizations could be safely added 
> for functions that would be locale independent as a result of your change.
> - This would allow eliminating more dead code, and make code calling those 
> functions (on constant arguments) faster by caching the resulting strings in 
> opcache.

I couldn't make this work. Even after setting
opcache.optimization_level to 0x7FBF (pass 6 will not run unless
pass 7 is disabled), zend_dfa_optimize_op_array() is called with
call_map=NULL, so ct_eval_func_call() is never entered. I'll leave
this change for someone who is able to test it (or for someone braver
than me).

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-22 Thread Tim Starling
On 19/9/21 12:33 am, tyson andre wrote:
> When implementing this, Zend/Optimizer/sccp.c has optimizations for functions 
> such as str_contains, etc to optimize.
> After removing locale dependence, those optimizations could be safely added 
> for functions that would be locale independent as a result of your change.
> - This would allow eliminating more dead code, and make code calling those 
> functions (on constant arguments) faster by caching the resulting strings in 
> opcache.

Thanks, I will do that.

> The function `zend_string_tolower` can safely be used to efficiently convert 
> strings to lowercase in a case-insensitive way.
> (zend_string_toupper hasn't been needed yet due to not yet having any use 
> cases in php-src's internals, but could be added in such a PR)

I uploaded my work so far and made a PR. It already has
zend_string_toupper.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-18 Thread tyson andre
Hi Tim Starling,
 
> I would like to know if a patch to make strtolower and strtoupper do
> plain ASCII case conversion would be accepted, or if an RFC should be
> created.
> 
> The situation with case conversion is inconsistent.
> 
> The following functions do ASCII case conversion: strcasecmp,
> strncasecmp, substr_compare.
> 
> The following functions do locale-dependent case conversion:
> strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
> strnatcasecmp, ucfirst, ucwords, lcfirst.
> 
> I would make them all do ASCII case conversion.
> 
> Developers need ASCII case conversion, because it is used internally
> by PHP for things like class name comparison, and because it is a
> specified algorithm in HTML 5 and related standards.
> 
> The existing options for ASCII case conversion are:
> 
> * Never call setlocale(). But this breaks non-ASCII characters in
escapeshellarg() and can't be guaranteed in a library.
> 
> * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.
> 
> * Use strtr(). But this is ugly and slow.
> 
> If mbstring has a way to do it, I can't find it. I tested
> mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').
> 
> Note that locale-dependent case conversion is almost never a useful
> feature. Strings are passed through tolower() one byte at a time, to
> be interpreted with some legacy 8-bit character set. So the result
> will typically be mojibake even if the correct locale is selected.
> 
> strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
> made a full list at . The
> UTF-8 locales mostly work, except for the Turkish ones, which mangle
> ASCII strings.
> 
> At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
> general recommendation is to avoid locales and locale-dependent
> functions, as locales are a fundamentally broken concept." I agree
> with that. I think PHP should migrate away from locale dependence.
> When PHP was young, it was convenient to use the C library, but we've
> progressed well past that point now.

I think it's a good idea (But would still require an RFC)
As you said, the way it acts on bytes rather than codepoints seems like it's 
almost always incorrect outside a narrow range
(except for rare charsets such as https://en.wikipedia.org/wiki/ISO/IEC_8859-1)

The behavior of strtolower is inconvenient for common uses in
- filesystem paths, where strolower('I') isn't 'i' in tr_TR
- username validation, if it's possible to create a new account that is 
considered the same case-insensitive strings in some locales but not others
- etc.

When implementing this, Zend/Optimizer/sccp.c has optimizations for functions 
such as str_contains, etc to optimize.
After removing locale dependence, those optimizations could be safely added for 
functions that would be locale independent as a result of your change.
- This would allow eliminating more dead code, and make code calling those 
functions (on constant arguments) faster by caching the resulting strings in 
opcache.

The function `zend_string_tolower` can safely be used to efficiently convert 
strings to lowercase in a case-insensitive way.
(zend_string_toupper hasn't been needed yet due to not yet having any use cases 
in php-src's internals, but could be added in such a PR)

```
841:|| zend_string_equals_literal(name, "str_contains")
842:|| zend_string_equals_literal(name, "str_ends_with")
843:|| zend_string_equals_literal(name, "str_replace")
844:|| zend_string_equals_literal(name, "str_split")
845:|| zend_string_equals_literal(name, "str_starts_with")
```

Thanks,
Tyson
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Nikita Popov
On Fri, Sep 17, 2021 at 12:07 PM Tim Starling 
wrote:

> On 17/9/21 7:15 pm, Kamil Tekiela wrote:
> > +1 from me. I wasn't even aware that these functions are
> > locale-dependent until recently. I see an added benefit that we could
> > add them to the optimizer once they are no longer locale-dependent.
> > What would happen to users who really need the locale-dependent
> > functions? Do we offer some workarounds?
>
> We could add a global mode, although that would prevent constant
> propagation, if that's what you mean by adding them to the optimizer.
> Or we could add variant functions like locale_strtolower() and
> locale_strtoupper(). But I think I would want to hear from someone who
> uses locale-dependence so I can understand what their needs are. I
> guess the RFC will sort that out.
>

I would expect that in nearly all cases the replacement would be one of
these:
1. You were using an UTF-8 locale (which you likely are), then just keep
using strtolower(). Without having checked all the details here, I think
strtolower() under UTF-8 locales already effectively behaves like ASCII
lowercase, because it skips continuation bytes.
2. If you were using some other charset, then using mb_strtolower() with
that charset should work. So if you were using de_DE.ISO8859-1, then using
mb_strtolower() with "ISO8859-1" encoding would be the replacement.

As a matter of general policy, it is unlikely that we will accept an option
(whether that be an ini option or something else) to control this behavior.
We can make the change or not make it, but not both ;)

Regards,
Nikita


Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Pierre Joye
Hi Tim,

hope you are well :)


On Fri, Sep 17, 2021, 5:07 PM Tim Starling  wrote:

>
> We could add a global mode, although that would prevent constant
> propagation, if that's what you mean by adding them to the optimizer.
> Or we could add variant functions like locale_strtolower() and
> locale_strtoupper(). But I think I would want to hear from someone who
> uses locale-dependence so I can understand what their needs are. I
> guess the RFC will sort that out.
>


may I suggest a function rather than a ino setting?

it has the advantage to be contextual and allows the user to enable/disable
it before calling some library api they may not be able to(or don't want
to) patch.

str_use_locale(bool) f.e.?

and at some point it can be false by default and later on removed.


best,
Pierre

>


Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Christian Schneider
Am 17.09.2021 um 10:43 schrieb Nikita Popov :
> The locale-sensitivity of strtolower() only works with legacy
> single-byte encodings and as such is of questionable usefulness even in
> cases where it is not actively harmful.
> 
> That said, I do think this change requires an RFC.

I agree that this is a big enough BC to require an RFC and I'd recommend a 
phase where strtolower in combination with locales where it *did* make 
something useful to show a deprecation warning to allow migration away from 
strtolower in those cases.

- Chris



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Tim Starling
On 17/9/21 6:43 pm, Nikita Popov wrote:
> We've been slowly moving away from locale-dependent functionality.
> Since PHP 8 we no longer inherit any locales from the environment and
> have made float to string conversion locale-independent.
> 
> I would very much support making strtolower() and friends a simple
> ASCII case conversion operation. mb_strtolower() etc already offer
> full Unicode-compliant case conversions that work correctly with
> multi-byte encodings. The locale-sensitivity of strtolower() only
> works with legacy single-byte encodings and as such is of questionable
> usefulness even in cases where it is not actively harmful.
> 
> That said, I do think this change requires an RFC.

Thanks Nikita. I'll write the code and then make an RFC.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Tim Starling
On 17/9/21 7:15 pm, Kamil Tekiela wrote:
> +1 from me. I wasn't even aware that these functions are
> locale-dependent until recently. I see an added benefit that we could
> add them to the optimizer once they are no longer locale-dependent. 
> What would happen to users who really need the locale-dependent
> functions? Do we offer some workarounds?

We could add a global mode, although that would prevent constant
propagation, if that's what you mean by adding them to the optimizer.
Or we could add variant functions like locale_strtolower() and
locale_strtoupper(). But I think I would want to hear from someone who
uses locale-dependence so I can understand what their needs are. I
guess the RFC will sort that out.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php



Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Kamil Tekiela
+1 from me. I wasn't even aware that these functions are locale-dependent
until recently. I see an added benefit that we could add them to the
optimizer once they are no longer locale-dependent.
What would happen to users who really need the locale-dependent functions?
Do we offer some workarounds?


Re: [PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-17 Thread Nikita Popov
On Fri, Sep 17, 2021 at 4:59 AM Tim Starling 
wrote:

> I would like to know if a patch to make strtolower and strtoupper do
> plain ASCII case conversion would be accepted, or if an RFC should be
> created.
>
> The situation with case conversion is inconsistent.
>
> The following functions do ASCII case conversion: strcasecmp,
> strncasecmp, substr_compare.
>
> The following functions do locale-dependent case conversion:
> strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
> strnatcasecmp, ucfirst, ucwords, lcfirst.
>
> I would make them all do ASCII case conversion.
>
> Developers need ASCII case conversion, because it is used internally
> by PHP for things like class name comparison, and because it is a
> specified algorithm in HTML 5 and related standards.
>
> The existing options for ASCII case conversion are:
>
> * Never call setlocale(). But this breaks non-ASCII characters in
> escapeshellarg() and can't be guaranteed in a library.
>
> * Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
> can't be guaranteed in a library.
>
> * Use strtr(). But this is ugly and slow.
>
> If mbstring has a way to do it, I can't find it. I tested
> mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').
>
> Note that locale-dependent case conversion is almost never a useful
> feature. Strings are passed through tolower() one byte at a time, to
> be interpreted with some legacy 8-bit character set. So the result
> will typically be mojibake even if the correct locale is selected.
>
> strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
> made a full list at . The
> UTF-8 locales mostly work, except for the Turkish ones, which mangle
> ASCII strings.
>
> At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
> general recommendation is to avoid locales and locale-dependent
> functions, as locales are a fundamentally broken concept." I agree
> with that. I think PHP should migrate away from locale dependence.
> When PHP was young, it was convenient to use the C library, but we've
> progressed well past that point now.
>
> -- Tim Starling
>

We've been slowly moving away from locale-dependent functionality. Since
PHP 8 we no longer inherit any locales from the environment and have made
float to string conversion locale-independent.

I would very much support making strtolower() and friends a simple ASCII
case conversion operation. mb_strtolower() etc already offer full
Unicode-compliant case conversions that work correctly with multi-byte
encodings. The locale-sensitivity of strtolower() only works with legacy
single-byte encodings and as such is of questionable usefulness even in
cases where it is not actively harmful.

That said, I do think this change requires an RFC.

Regards,
Nikita


[PHP-DEV] Make strtolower/strtoupper just do ASCII

2021-09-16 Thread Tim Starling
I would like to know if a patch to make strtolower and strtoupper do
plain ASCII case conversion would be accepted, or if an RFC should be
created.

The situation with case conversion is inconsistent.

The following functions do ASCII case conversion: strcasecmp,
strncasecmp, substr_compare.

The following functions do locale-dependent case conversion:
strtolower, strtoupper, str_ireplace, stristr, stripos, strripos,
strnatcasecmp, ucfirst, ucwords, lcfirst.

I would make them all do ASCII case conversion.

Developers need ASCII case conversion, because it is used internally
by PHP for things like class name comparison, and because it is a
specified algorithm in HTML 5 and related standards.

The existing options for ASCII case conversion are:

* Never call setlocale(). But this breaks non-ASCII characters in
escapeshellarg() and can't be guaranteed in a library.

* Call setlocale(LC_ALL, "C.UTF-8"). But this is non-portable and also
can't be guaranteed in a library.

* Use strtr(). But this is ugly and slow.

If mbstring has a way to do it, I can't find it. I tested
mb_strtolower($s, '8bit') and mb_strtolower($s,'ascii').

Note that locale-dependent case conversion is almost never a useful
feature. Strings are passed through tolower() one byte at a time, to
be interpreted with some legacy 8-bit character set. So the result
will typically be mojibake even if the correct locale is selected.

strtolower() mangles UTF-8 strings in many locales, such as fr-FR. I
made a full list at . The
UTF-8 locales mostly work, except for the Turkish ones, which mangle
ASCII strings.

At https://bugs.php.net/bug.php?id=67815 , Nikita Popov wrote: "My
general recommendation is to avoid locales and locale-dependent
functions, as locales are a fundamentally broken concept." I agree
with that. I think PHP should migrate away from locale dependence.
When PHP was young, it was convenient to use the C library, but we've
progressed well past that point now.

-- Tim Starling

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php