Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-28 Thread Casper Langemeijer


> So... if you want to help make people more aware of the grapheme_* 
> functions, one place to start would be editing the documentation for the 
> various string, mbstring, and grapheme functions to use consistent 
> terminology, and sign-post each other more clearly. 
> http://doc.php.net/tutorial/

Yes I agree, Also I've edited documentation before in the svn days. I already 
planned to read up on how this is working nowadays.

Also I'm planning an outline for a conference talk on the subject. I've 
educated people on unicode related subjects before, and think I have a few very 
good stories that can give insight into this for unsuspecting developers.

I love the analogy that most Europeans understand. For the city of Cologne, 
there are two equally valid ways to write it's German name. Köln and Koeln. 
(Used when hindered by technical limitations, or maybe in informal 
conversation) Every German can extra_e_decode() and extra_e_encode(). Same for 
Straße and Strasse.

Ligatures in fonts make it harder though, sometimes they intentionally 
obfuscate what's happening in the unicode layer. You might know this from 
special programming fonts with glyphs for ===, <> and such.

Some Dutch fonts do a special ligature that combines ij, which was in the Dutch 
alphabet when I was a kid, 'y' was not. Unicode U+0132 and U+0133 describe this 
symbol, but I've never seen them used. Fonts combining ij to one visual entity 
is more common.

I imagine most languages and cultures have these kind of edge-cases.



Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-28 Thread Derick Rethans
On 27 March 2024 23:18:21 GMT, "Rowan Tommins [IMSoP]"  
wrote:
>On 26/03/2024 21:14, Casper Langemeijer wrote:
>> If you need someone to help for the grapheme_ marketing team, let me know.
>
>I think a big part of the problem is that very few people dig into the 
>complexities of text encoding, and so don't know that a "grapheme" is what 
>they're looking for.
>
>Unicode documentation is, generally, very careful with its terminology - 
>distinguishing between "code points", "code units" "graphemes" , "grapheme 
>clusters", "glyphs", etc. Pretty much everyone else just says "character", and 
>assumes that everyone knows what they mean.

That's why I have been working on 


Takes all the (or most) terminology out of it. 

It's time to resurrect it. 

cheers 
Derick 


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-27 Thread Rowan Tommins [IMSoP]

On 26/03/2024 21:14, Casper Langemeijer wrote:

If you need someone to help for the grapheme_ marketing team, let me know.


I think a big part of the problem is that very few people dig into the 
complexities of text encoding, and so don't know that a "grapheme" is 
what they're looking for.


Unicode documentation is, generally, very careful with its terminology - 
distinguishing between "code points", "code units" "graphemes" , 
"grapheme clusters", "glyphs", etc. Pretty much everyone else just says 
"character", and assumes that everyone knows what they mean.



As a case in point, looking at the PHP manual pages for strlen, 
mb_strlen, and grapheme_strlen:


Short summary:

- strlen — Get string length
- mb_strlen — Get string length
- grapheme_strlen — Get string length in grapheme units

Description:

- Returns the length of the given string.
- Gets the length of a string.
- Get string length in grapheme units (not bytes or characters)


The first two don't actually say what units they're measuring in. Maybe 
it's millimetres? ;)


The last one uses the term "grapheme" without explaining what it means, 
and makes a contrast with "characters", which is confusing, as one of 
the definitions in the Unicode glossary 
[https://unicode.org/glossary/#grapheme] is:


> What a user thinks of as a character.


The mb_strlen documentation has a bit more explanation in its Return 
Values section:


> Returns the number of characters in string string having character 
encoding encoding. A multi-byte character is counted as 1.


For Unicode in particular, this is a poor description; it is completely 
missing the term "code point", which is what it actually counts.


That's probably because ext/mbstring wasn't written with Unicode in 
mind, it was "developed to handle Japanese characters", back in 2001; 
and it still does support several pre-Unicode "multi-byte encodings". 
For a bit of nostalgia: 
http://web.archive.org/web/20010605075550/http://www.php.net/manual/en/ref.mbstring.php



So... if you want to help make people more aware of the grapheme_* 
functions, one place to start would be editing the documentation for the 
various string, mbstring, and grapheme functions to use consistent 
terminology, and sign-post each other more clearly. 
http://doc.php.net/tutorial/



Regards,

--
Rowan Tommins
[IMSoP]


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-26 Thread youkidearitai
2024年3月27日(水) 6:18 Casper Langemeijer :
>
> On Tue, Mar 26, 2024, at 18:15, Derick Rethans wrote:
>
> Many of these already exist, such as grapheme_substr. We can't simply change 
> the behaviour of the already existing functions due to BC reasons.
>
>
> Wow. I feel very stupid. I feel I should have known about grapheme_*, but I 
> didn't. Oh my, the manual says since PHP 5.3 no less. From what I've seen 
> around being used, I'm far from the only one though. In an attempt to justify 
> my own stupidity I searched its use and it's bad.
>
> Searching on github with language:PHP:
> `mb_strlen`  84k files, `grapheme_strlen` 680
>
> Then a big number of first 100 of these files are stubs/polyfills/phpstan 
> metadata. I've seen no framework except Symphony (but they might be further 
> in the searchresults)
>
> The grapheme_str_split function, as well as other intl extension functions is 
> what should replace mbstring really.
>
>
> YES!
>
> I'm sorry to have wasted your time. If you need someone to help for the 
> grapheme_ marketing team, let me know.

Hi, Casper

I think still useful mbstring functions. Because mbstring functions is
still valid as a bridge to non-Unicode character codes.
We think it makes sense for mbstring to calculate in Unicode code point units.

Therefore, I think make sense that separate mbstring functions and
grapheme functions.

Regards
Yuya

-- 
---
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-26 Thread Casper Langemeijer
On Tue, Mar 26, 2024, at 18:15, Derick Rethans wrote:
> Many of these already exist, such as grapheme_substr. We can't simply change 
> the behaviour of the already existing functions due to BC reasons. 

Wow. I feel very stupid. I feel I should have known about grapheme_*, but I 
didn't. Oh my, the manual says since PHP 5.3 no less. From what I've seen 
around being used, I'm far from the only one though. In an attempt to justify 
my own stupidity I searched its use and it's bad.

Searching on github with language:PHP:
`mb_strlen`  84k files, `grapheme_strlen` 680

Then a big number of first 100 of these files are stubs/polyfills/phpstan 
metadata. I've seen no framework except Symphony (but they might be further in 
the searchresults)

> The grapheme_str_split function, as well as other intl extension functions is 
> what should replace mbstring really. 

YES!

I'm sorry to have wasted your time. If you need someone to help for the 
grapheme_ marketing team, let me know.

Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-26 Thread Derick Rethans
On 26 March 2024 17:04:18 GMT, Casper Langemeijer  wrote:
>I'd like to address an issue I have with this RFC.

Please don't top reply. 

>I'm not sure is solves a problem by itself. If I understand all of this 
>correctly this only does what already can be accomplished with 
>preg_match_all('/\X/u', ...). The result of this method in my opinion is not 
>very usefull by itself. I've done some searching on various code platforms 
>where I mostly find the use-case for counting the number of grapheme's. I've 
>used it to implement strrev() that correctly works multibyte. 
>
>I'm very sad that mbstring works on codepoints instead of grapheme's and I 
>would very much like to see something happening in that area, but I think 
>expanding a simple string to an array of as many elements to give developers a 
>tool to do this in PHP-space is not good enough. Especially since it can 
>already be achieved with a regexp that already works.
>
>In my opinion: This adds nothing, and tells the PHP developer that is ok to do 
>count(grapheme_str_split()) for a more accurate mb_strlen().
>
>I would like to see a family of functions that can do multibyte str_split(), 
>strrev(), substr(). Ideally as bugfix in mb_* functions, because the edge-case 
>of wanting to know the length in codepoints of a string is a weird edge-case. 
>No developer wants to know that. mb_strlen() should have returned the number 
>of graphemes from the start.

Many of these already exist, such as grapheme_substr. We can't simply change 
the behaviour of the already existing functions due to BC reasons. 

The intl extension is also built on ICU, an actual unicode text processing 
library. 

The grapheme_str_split function, as well as other intl extension functions is 
what should replace mbstring really. 

cheers 
Derick 


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-26 Thread Casper Langemeijer
I'd like to address an issue I have with this RFC.

I'm not sure is solves a problem by itself. If I understand all of this 
correctly this only does what already can be accomplished with 
preg_match_all('/\X/u', ...). The result of this method in my opinion is not 
very usefull by itself. I've done some searching on various code platforms 
where I mostly find the use-case for counting the number of grapheme's. I've 
used it to implement strrev() that correctly works multibyte. 

I'm very sad that mbstring works on codepoints instead of grapheme's and I 
would very much like to see something happening in that area, but I think 
expanding a simple string to an array of as many elements to give developers a 
tool to do this in PHP-space is not good enough. Especially since it can 
already be achieved with a regexp that already works.

In my opinion: This adds nothing, and tells the PHP developer that is ok to do 
count(grapheme_str_split()) for a more accurate mb_strlen().

I would like to see a family of functions that can do multibyte str_split(), 
strrev(), substr(). Ideally as bugfix in mb_* functions, because the edge-case 
of wanting to know the length in codepoints of a string is a weird edge-case. 
No developer wants to know that. mb_strlen() should have returned the number of 
graphemes from the start.


On Tue, Mar 26, 2024, at 01:44, youkidearitai wrote:
> 2024年3月26日(火) 5:43 David CARLIER :
> >
> > I second this, I think it is a good addition which makes a lot of sense.
> >
> > Cheers.
> >
> > On Mon, 25 Mar 2024 at 20:36, Ayesh Karunaratne  wrote:
> >>
> >> >
> >> > 2024年3月9日(土) 15:26 youkidearitai :
> >> > >
> >> > > Hello, Internals
> >> > >
> >> > > I created an wiki for `grapheme_str_split` function.
> >> > > Please see:
> >> > > https://wiki.php.net/rfc/grapheme_str_split
> >> > >
> >> > > I would like to "Under Discussion" section.
> >> > >
> >> > > Best Regards
> >> > > Yuya
> >> > >
> >> > > --
> >> > > ---
> >> > > Yuya Hamada (tekimen)
> >> > > - https://tekitoh-memdhoi.info
> >> > > - https://github.com/youkidearitai
> >> > > -
> >> >
> >> > Hello, Internals
> >> >
> >> > I want to go to "Voting" phase if nothing any comment.
> >> > I will start at tomorrow(26th) to "Voting" phase.
> >> >
> >> > Thank you
> >> > Yuya
> >> >
> >> > --
> >> > ---
> >> > Yuya Hamada (tekimen)
> >> > - https://tekitoh-memdhoi.info
> >> > - https://github.com/youkidearitai
> >> > -
> >>
> >> I think it makes sense to add this function, and the PR worked well
> >> too; It correctly split individual graphemes for all comlex Emojis,
> >> ZWJs, and those Cthulu texts, and everything else I threw at it.
> >>
> >> Good luck for the RFC vote today, hope it passes 爛.
> 
> 
> Hi, Internals
> 
> grapheme_str_split going to "Voting" phase.
> Vote end is 10th April 00:00 GMT
> 
> Regards
> Yuya
> 
> -- 
> ---
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - https://github.com/youkidearitai
> -
> 


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-25 Thread youkidearitai
2024年3月26日(火) 5:43 David CARLIER :
>
> I second this, I think it is a good addition which makes a lot of sense.
>
> Cheers.
>
> On Mon, 25 Mar 2024 at 20:36, Ayesh Karunaratne  wrote:
>>
>> >
>> > 2024年3月9日(土) 15:26 youkidearitai :
>> > >
>> > > Hello, Internals
>> > >
>> > > I created an wiki for `grapheme_str_split` function.
>> > > Please see:
>> > > https://wiki.php.net/rfc/grapheme_str_split
>> > >
>> > > I would like to "Under Discussion" section.
>> > >
>> > > Best Regards
>> > > Yuya
>> > >
>> > > --
>> > > ---
>> > > Yuya Hamada (tekimen)
>> > > - https://tekitoh-memdhoi.info
>> > > - https://github.com/youkidearitai
>> > > -
>> >
>> > Hello, Internals
>> >
>> > I want to go to "Voting" phase if nothing any comment.
>> > I will start at tomorrow(26th) to "Voting" phase.
>> >
>> > Thank you
>> > Yuya
>> >
>> > --
>> > ---
>> > Yuya Hamada (tekimen)
>> > - https://tekitoh-memdhoi.info
>> > - https://github.com/youkidearitai
>> > -
>>
>> I think it makes sense to add this function, and the PR worked well
>> too; It correctly split individual graphemes for all comlex Emojis,
>> ZWJs, and those Cthulu texts, and everything else I threw at it.
>>
>> Good luck for the RFC vote today, hope it passes 爛.


Hi, Internals

grapheme_str_split going to "Voting" phase.
Vote end is 10th April 00:00 GMT

Regards
Yuya

-- 
---
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-25 Thread David CARLIER
I second this, I think it is a good addition which makes a lot of sense.

Cheers.

On Mon, 25 Mar 2024 at 20:36, Ayesh Karunaratne  wrote:

> >
> > 2024年3月9日(土) 15:26 youkidearitai :
> > >
> > > Hello, Internals
> > >
> > > I created an wiki for `grapheme_str_split` function.
> > > Please see:
> > > https://wiki.php.net/rfc/grapheme_str_split
> > >
> > > I would like to "Under Discussion" section.
> > >
> > > Best Regards
> > > Yuya
> > >
> > > --
> > > ---
> > > Yuya Hamada (tekimen)
> > > - https://tekitoh-memdhoi.info
> > > - https://github.com/youkidearitai
> > > -
> >
> > Hello, Internals
> >
> > I want to go to "Voting" phase if nothing any comment.
> > I will start at tomorrow(26th) to "Voting" phase.
> >
> > Thank you
> > Yuya
> >
> > --
> > ---
> > Yuya Hamada (tekimen)
> > - https://tekitoh-memdhoi.info
> > - https://github.com/youkidearitai
> > -
>
> I think it makes sense to add this function, and the PR worked well
> too; It correctly split individual graphemes for all comlex Emojis,
> ZWJs, and those Cthulu texts, and everything else I threw at it.
>
> Good luck for the RFC vote today, hope it passes 爛.
>


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-25 Thread Ayesh Karunaratne
>
> 2024年3月9日(土) 15:26 youkidearitai :
> >
> > Hello, Internals
> >
> > I created an wiki for `grapheme_str_split` function.
> > Please see:
> > https://wiki.php.net/rfc/grapheme_str_split
> >
> > I would like to "Under Discussion" section.
> >
> > Best Regards
> > Yuya
> >
> > --
> > ---
> > Yuya Hamada (tekimen)
> > - https://tekitoh-memdhoi.info
> > - https://github.com/youkidearitai
> > -
>
> Hello, Internals
>
> I want to go to "Voting" phase if nothing any comment.
> I will start at tomorrow(26th) to "Voting" phase.
>
> Thank you
> Yuya
>
> --
> ---
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - https://github.com/youkidearitai
> -

I think it makes sense to add this function, and the PR worked well
too; It correctly split individual graphemes for all comlex Emojis,
ZWJs, and those Cthulu texts, and everything else I threw at it.

Good luck for the RFC vote today, hope it passes 爛.


Re: [PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-24 Thread youkidearitai
2024年3月9日(土) 15:26 youkidearitai :
>
> Hello, Internals
>
> I created an wiki for `grapheme_str_split` function.
> Please see:
> https://wiki.php.net/rfc/grapheme_str_split
>
> I would like to "Under Discussion" section.
>
> Best Regards
> Yuya
>
> --
> ---
> Yuya Hamada (tekimen)
> - https://tekitoh-memdhoi.info
> - https://github.com/youkidearitai
> -

Hello, Internals

I want to go to "Voting" phase if nothing any comment.
I will start at tomorrow(26th) to "Voting" phase.

Thank you
Yuya

-- 
---
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-


[PHP-DEV][RFC] grapheme cluster for str_split, grapheme_str_split function

2024-03-08 Thread youkidearitai
Hello, Internals

I created an wiki for `grapheme_str_split` function.
Please see:
https://wiki.php.net/rfc/grapheme_str_split

I would like to "Under Discussion" section.

Best Regards
Yuya

-- 
---
Yuya Hamada (tekimen)
- https://tekitoh-memdhoi.info
- https://github.com/youkidearitai
-