Re: [PHP-DEV] Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

2020-10-02 Thread Claude Pache
Hi,

Working with UTF-8-encoded strings does not implies working with mb_string 
functions or with code-point counts. Personnally, I work with standard string 
functions, plus [Grapheme functions] 
(https://www.php.net/manual/en/ref.intl.grapheme.php 
) when I need to split my 
string between “characters” (which means for me “grapheme clusters”, not “code 
points”, so that mb_string functions are useless for me). In particular, 
PREG_OFFSET_CAPTURE does always what I need, even when using the /u flag.

If this is a feature that you want to implement, I suggests adding a flag 
PREG_UTF8_CODEPOINT_OFFSET_CAPTURE.

—Claude





[PHP-DEV] Suggestion: Make all PCRE functions return *character* offsets, rather than *byte* offsets if the modifier `u` (PCRE_UTF8) is given

2020-10-02 Thread Thomas Landauer
Hi,

this is a follow-up of a bug I opened, and cmb suggested to continue
here: https://bugs.php.net/bug.php?id=80166

Advantages:

1: Easier string manipulation:
If somebody does (as in my case) `preg_match_all()` with
PREG_OFFSET_CAPTURE, what will they probably use those returned
numbers/offsets for?
My answer: For *splitting the string* - in some way or the other. Now,
with byte offsets, I can't do such basic things as just `+1` to get to
the next character. Or extract exactly 3 characters.

2: Better performance:
This may sound odd, since cmb said the exact opposite ;-) (sequential
access vs. random access). However, if I need character offsets (see 1),
what can I do? I'm forced to use some workaround on top - as e.g.
https://www.php.net/manual/en/function.preg-match-all.php#71572 - which
is certainly way slower than any native implementation.

3: Consistency with users' expectations:
The current behavior is causing confusion and is perceived as
counter-intuitive, see
https://www.php.net/manual/en/function.preg-match-all.php#61426 and
https://stackoverflow.com/questions/1725227/preg-match-and-utf-8-in-php

So I'm suggesting:

* Either do the BC break, and just return byte offsets if the modifier
`u` is given.
* Or create *new* functions for it: `mb_preg_match_all()` etc.

--

Cheers,
Thomas

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: https://www.php.net/unsub.php