Re: [PHP-DEV] [RFC] UString
Hi. Am 02.07.15 um 15:43 schrieb Ivan Enderlin@Hoa: Hello :-), Just a small detail. Please, choose another name. The `Hoa\String` https://packagist.org/packages/hoa/string library has been renamed to `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the library again ;-). What's the issue with the name? As far as I see it, There's no problem at all, as there's UString and then there's Hoa\UString. Different namespace, no issue. Or am I missing something? Cheers Andreas Moreover, this library provides an API that is useful for daily use and can be inspiring. Please, see http://hoa-project.net/Literature/Hack/Ustring.html. Regards. On 01/07/15 01:30, Sara Golemon wrote: On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. For my bikeshedding part, I'd recommend against the u() function helper as it pollutes the global function namespace and takes a very fundamental name. intl\u() might be worth considering, but we'll need to have a discussion about namespacing for the intl extension as a whole (separate topic). I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). Otherwise, while there's room to quibble about specific API names and arguments, the general concept seems a no-brainer. -Sara -- ,,, (o o) +-ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andr...@heigl.org N 50°22'59.5 E 08°23'58 | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +-+ | http://hei.gl/root-ca | +-+ smime.p7s Description: S/MIME Cryptographic Signature
Re: [PHP-DEV] [RFC] UString
I fear it will be a reserved keyword. On 02/07/15 15:46, Andreas Heigl wrote: Hi. Am 02.07.15 um 15:43 schrieb Ivan Enderlin@Hoa: Hello :-), Just a small detail. Please, choose another name. The `Hoa\String` https://packagist.org/packages/hoa/string library has been renamed to `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the library again ;-). What's the issue with the name? As far as I see it, There's no problem at all, as there's UString and then there's Hoa\UString. Different namespace, no issue. Or am I missing something? Cheers Andreas Moreover, this library provides an API that is useful for daily use and can be inspiring. Please, see http://hoa-project.net/Literature/Hack/Ustring.html. Regards. On 01/07/15 01:30, Sara Golemon wrote: On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. For my bikeshedding part, I'd recommend against the u() function helper as it pollutes the global function namespace and takes a very fundamental name. intl\u() might be worth considering, but we'll need to have a discussion about namespacing for the intl extension as a whole (separate topic). I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). Otherwise, while there's room to quibble about specific API names and arguments, the general concept seems a no-brainer. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hello :-), Just a small detail. Please, choose another name. The `Hoa\String` https://packagist.org/packages/hoa/string library has been renamed to `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the library again ;-). Moreover, this library provides an API that is useful for daily use and can be inspiring. Please, see http://hoa-project.net/Literature/Hack/Ustring.html. Regards. On 01/07/15 01:30, Sara Golemon wrote: On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. For my bikeshedding part, I'd recommend against the u() function helper as it pollutes the global function namespace and takes a very fundamental name. intl\u() might be worth considering, but we'll need to have a discussion about namespacing for the intl extension as a whole (separate topic). I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). Otherwise, while there's room to quibble about specific API names and arguments, the general concept seems a no-brainer. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Ivan 2015-07-02 15:48 GMT+02:00 Ivan Enderlin@Hoa ivan.ender...@hoa-project.net: I fear it will be a reserved keyword. Internally defined classes, such as UConverter or stdClass are not reserved keywords, they are not an actual part of the language but a part of the library. Code like the one below is perfectly valid, meaning the example you made will continue to work as long it remains within a namespace: C:\dev\php-srcphp -r namespace stdlib; class stdclass { } var_dump(get_class(new stdclass), get_class(new \stdClass)); string(15) stdlib\stdclass string(8) stdClass -- regards, Kalle Sommer Nielsen ka...@php.net -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, Jul 2, 2015 at 6:43 AM, Ivan Enderlin@Hoa ivan.ender...@hoa-project.net wrote: Just a small detail. Please, choose another name. The `Hoa\String` https://packagist.org/packages/hoa/string library has been renamed to `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the library again ;-). As replied by others, no need for concern on that front. As \UString and Hoa\UString can live side-by-side. However, I would like to bump my earlier suggestion to go with IntlString and make this functionality be part of the intl extension. I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org wrote: Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; That sounds as a cool idea to discuss as a completely separate, unrelated RFC, and not specific to UString. e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new ClassName($arg); So you could use casting with any class which supports single-argument constructors. But again, orthogonal to this RFC. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Jul 1, 2015, at 1:06 PM, Sara Golemon poll...@php.net wrote: On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org wrote: Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; That sounds as a cool idea to discuss as a completely separate, unrelated RFC, and not specific to UString. e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new ClassName($arg); So you could use casting with any class which supports single-argument constructors. But again, orthogonal to this RFC. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Expanding on this idea, a separate RFC could propose a magic __cast($value) static method that would be called for code like below: $obj = (ClassName) $scalarOrObject; // Invokes ClassName::__cast($scalarOrObject); This would allow UString to implement casting a string to a UString and allow users to implement such behavior with their own classes. However, I would not implement such casting syntax for UString only. Being able to write $ustring = (UString) $string; without the ability to do so for other classes would be unusual and confusing in my opinion. If an RFC adding such behavior was implemented, UString could be updated to support casting. Obviously a UString should be able to be cast to a scalar string using (string) $ustring. If performance is a concern, UString::__toString() should cache the result so multiple casts to the same object are quick. Aaron Piotrowski -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] [RFC] UString
Hi, -Original Message- From: Aaron Piotrowski [mailto:aa...@icicle.io] Sent: Wednesday, July 1, 2015 9:00 PM To: Sara Golemon Cc: pthre...@pthreads.org; internals@lists.php.net Subject: Re: [PHP-DEV] [RFC] UString On Jul 1, 2015, at 1:06 PM, Sara Golemon poll...@php.net wrote: On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org wrote: Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; That sounds as a cool idea to discuss as a completely separate, unrelated RFC, and not specific to UString. e.g. $obj = (ClassName)$arg; /* turns into */ $obj = new ClassName($arg); So you could use casting with any class which supports single-argument constructors. But again, orthogonal to this RFC. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Expanding on this idea, a separate RFC could propose a magic __cast($value) static method that would be called for code like below: $obj = (ClassName) $scalarOrObject; // Invokes ClassName::__cast($scalarOrObject); This would allow UString to implement casting a string to a UString and allow users to implement such behavior with their own classes. However, I would not implement such casting syntax for UString only. Being able to write $ustring = (UString) $string; without the ability to do so for other classes would be unusual and confusing in my opinion. If an RFC adding such behavior was implemented, UString could be updated to support casting. Obviously a UString should be able to be cast to a scalar string using (string) $ustring. If performance is a concern, UString::__toString() should cache the result so multiple casts to the same object are quick. One way doing this is already there thanks https://wiki.php.net/rfc/operator_overloading_gmp . Consider $n = gmp_init(42); var_dump($n, (int)$n); However the other way round - could be done on case by case basis, IMHO. Where it could make sense for class vs scalar, casting class to class is a quite unpredictable thing. While users could implement it, how is it handled with arbitrary objects? How would it map properties, would those classes need to implement the same interface, et cetera? We're not in C at this point, where we would just force a block of memory to be interpreted as we want. Regards Anatol -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Jul 1, 2015, at 2:25 PM, Anatol Belski anatol@belski.net wrote: Expanding on this idea, a separate RFC could propose a magic __cast($value) static method that would be called for code like below: $obj = (ClassName) $scalarOrObject; // Invokes ClassName::__cast($scalarOrObject); This would allow UString to implement casting a string to a UString and allow users to implement such behavior with their own classes. However, I would not implement such casting syntax for UString only. Being able to write $ustring = (UString) $string; without the ability to do so for other classes would be unusual and confusing in my opinion. If an RFC adding such behavior was implemented, UString could be updated to support casting. Obviously a UString should be able to be cast to a scalar string using (string) $ustring. If performance is a concern, UString::__toString() should cache the result so multiple casts to the same object are quick. Hi, One way doing this is already there thanks https://wiki.php.net/rfc/operator_overloading_gmp . Consider $n = gmp_init(42); var_dump($n, (int)$n); However the other way round - could be done on case by case basis, IMHO. Where it could make sense for class vs scalar, casting class to class is a quite unpredictable thing. While users could implement it, how is it handled with arbitrary objects? How would it map properties, would those classes need to implement the same interface, et cetera? We're not in C at this point, where we would just force a block of memory to be interpreted as we want. Regards Anatol Hello, I was thinking that the __cast() static method would examine the parameter given, then use that value to build a new object and return it or return null (which would then result in the engine throwing an Error saying that $scalarOrValue could not be cast to ClassName). It was just a suggestion to see what others thought because someone suggested supporting casting syntax such as $ustring = (UString) $scalarString. I don’t really care for either method though (__cast() or enabling casting just for UString), as they don't offer any advantage over writing new UString($string) or UString::fromString($string). Aaron Piotrowski -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Joe. Am 01.07.15 um 07:36 schrieb Joe Watkins: [..] Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; These aren't very different to 'new UString', but for an integrated solution, kind of expected to work. Why would that be expected behaviour? I mean I can't do $date = (DateTime) $timestring; after all, can I? But I can use $date = new DateTime($timestring); Just my 2 Cent. Cheers Andreas -- ,,, (o o) +-ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andr...@heigl.org N 50°22'59.5 E 08°23'58 | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +-+ | http://hei.gl/root-ca | +-+ smime.p7s Description: S/MIME Cryptographic Signature
Re: [PHP-DEV] [RFC] UString
Morning, Why would that be expected behaviour? I mean I can't do $date = (DateTime) $timestring; No, but you can't do: $string = (string) $datetime; But can do: $string = (string) $ustring; Where $ustring is instanceof UString. Even if you never write $string = (string) $ustring, the engine will perform the same action all the time, whenever you pass a UString to anything expecting string. It feels like a complete implementation should support both casts. Cheers Joe On Wed, Jul 1, 2015 at 7:38 AM, Andreas Heigl andr...@heigl.org wrote: Hi Joe. Am 01.07.15 um 07:36 schrieb Joe Watkins: [..] Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; These aren't very different to 'new UString', but for an integrated solution, kind of expected to work. Why would that be expected behaviour? I mean I can't do $date = (DateTime) $timestring; after all, can I? But I can use $date = new DateTime($timestring); Just my 2 Cent. Cheers Andreas -- ,,, (o o) +-ooO-(_)-Ooo-+ | Andreas Heigl | | mailto:andr...@heigl.org N 50°22'59.5 E 08°23'58 | | http://andreas.heigl.org http://hei.gl/wiFKy7 | +-+ | http://hei.gl/root-ca | +-+
Re: [PHP-DEV] [RFC] UString
Morning Sara, Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I was (semi) convinced by Dmitry that the superior implementation is one for Zend, so I backed off ... I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. The folder the source code is in makes no nevermind, the real issue with integration is changing all of intl, and lots of other stuff, to accept UString, since casting to basic type , while acceptable for simple tests, would get extremely wasteful for an application of any complexity. Another possible issue is engine integration: $string = (UString) $someString; $string = (UString) someString; These aren't very different to 'new UString', but for an integrated solution, kind of expected to work. I don't know what the solutions are to these problems, I'm all ears ... Cheers Joe On Wed, Jul 1, 2015 at 12:30 AM, Sara Golemon poll...@php.net wrote: On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. For my bikeshedding part, I'd recommend against the u() function helper as it pollutes the global function namespace and takes a very fundamental name. intl\u() might be worth considering, but we'll need to have a discussion about namespacing for the intl extension as a whole (separate topic). I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). Otherwise, while there's room to quibble about specific API names and arguments, the general concept seems a no-brainer. -Sara
Re: [PHP-DEV] [RFC] UString
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Curious what the current state of the UString RFC is. I've got a functionality request for HHVM to wrap icu::UnicodeString and was hoping to match PHP behavior if any plans had been made, and lo... here's a plan! I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. I think (hope) that Joe's intention was to make it as an extension for proof of concept, but make it part of the intl extension when it comes to full adoption by the runtime. If not, let's talk about making that the intent, because intl is where this belongs. For my bikeshedding part, I'd recommend against the u() function helper as it pollutes the global function namespace and takes a very fundamental name. intl\u() might be worth considering, but we'll need to have a discussion about namespacing for the intl extension as a whole (separate topic). I'd also recommend IntlString rather than UString as nearly all the Intl classes follow this convention. The one notable exception being UConverter (which yes, I added, and I regret the departure in naming). Otherwise, while there's room to quibble about specific API names and arguments, the general concept seems a no-brainer. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. On a more general note, I'd appreciate it if RFCs proposing the inclusion of extensions moved more of their content into the actual RFC, as opposed to being thin wrappers around the extension README/docs. We had this issue with the pecl_http RFC and the same applies here. I think the suggested API is a pretty important aspect of the proposal and as such should be included in the RFC and maybe also commented a bit ;) Nikita
Re: [PHP-DEV] [RFC] UString
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote: On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) I'm not totally convinced by this proposal. We already have quite a number of extensions that deal with unicode text in one way or another (at least intl, mbstring and iconv). This adds yet another way of dealing with this issue - a way that will have to be combined with at least two other extensions (mbstring or iconv for input handling and conversion) and intl for any non-trivial operations. There's nothing wrong with adding another approach for unicode handling per se, but I'd like to have more empahsis on how this integrates with existing functionality and why it is implemented separately from it (especially intl), etc. On a more general note, I'd appreciate it if RFCs proposing the inclusion of extensions moved more of their content into the actual RFC, as opposed to being thin wrappers around the extension README/docs. We had this issue with the pecl_http RFC and the same applies here. I think the suggested API is a pretty important aspect of the proposal and as such should be included in the RFC and maybe also commented a bit ;) Full ack. Both paragraph. As of now, and based on the previous discussions pointed out the same issues (minus the RFC one, but this is a detail, important, but a detail), I am also not convinced this is the way to tackle the Unicode text support. It should either be part of intl (and proposed to enable intl always for 7, with other RFC) or main. Main has the advantage to provide a easier integration with other extensions. Cheers, -- Pierre @pierrejoye | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Lester, On Mon, Mar 2, 2015 at 5:34 AM, Lester Caine les...@lsces.co.uk wrote: On 28/02/15 06:48, Joe Watkins wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. Since there is nothing in this which needs any changes to the core then surly it simply needs to exist in pecl until such time as a proper replacement for unicode in core strings has been addressed? Since it will still require intl to provide those areas it does not support, and I question if we really need to provide yet another encoding converter. A unicode string handler that just handles UTF8 strings may be yet another stepping stone, but it still falls short of beings able to handle all of the internationalization problems and is simply an alternate to mbstring so one either runs both, or sit down and convert all the third party libraries to eliminate mbstring. Like http extension, it's not essential that it's loaded by default, and leaving it in pecl allows development outside that of the core? Although it seems current code does not have code like GMP. I'm sure we'll have this before release. i.e. $new = $some_ustring . 'abc'; // $new is UString object To implement feature like this, it cannot be PECL. My only concern for this RFC performance. It's loosely integrated into PHP core, it may affect efficiency. I suppose other people are working on simple and tighter integration into core. Any comments on this? Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
On 01/03/2015 21:26, Yasuo Ohgaki wrote: Although it seems current code does not have code like GMP. I'm sure we'll have this before release. i.e. $new = $some_ustring . 'abc'; // $new is UString object To implement feature like this, it cannot be PECL. Why not? I would have thought any extension can hook into the operator overloading API that GMP uses, just as they can hook into other object behaviours. Is there some difference between how bundled and PECL extensions are loaded that would prevent this? Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Rowan, On Mon, Mar 2, 2015 at 6:32 AM, Rowan Collins rowan.coll...@gmail.com wrote: On 01/03/2015 21:26, Yasuo Ohgaki wrote: Although it seems current code does not have code like GMP. I'm sure we'll have this before release. i.e. $new = $some_ustring . 'abc'; // $new is UString object To implement feature like this, it cannot be PECL. Why not? I would have thought any extension can hook into the operator overloading API that GMP uses, just as they can hook into other object behaviours. Is there some difference between how bundled and PECL extensions are loaded that would prevent this? OK. I missed that GMP improvement includes generic operator overloading. If current implementation is good enough for UString, it could be PECL. Or add missing parts in core to make UString PECL. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
Hi Joe and Rowan, On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.coll...@gmail.com wrote: On 01/03/2015 20:34, Lester Caine wrote: On 28/02/15 06:48, Joe Watkins wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. Since there is nothing in this which needs any changes to the core then surly it simply needs to exist in pecl until such time as a proper replacement for unicode in core strings has been addressed? Since it will still require intl to provide those areas it does not support, and I question if we really need to provide yet another encoding converter. A unicode string handler that just handles UTF8 strings may be yet another stepping stone, but it still falls short of beings able to handle all of the internationalization problems and is simply an alternate to mbstring so one either runs both, or sit down and convert all the third party libraries to eliminate mbstring. Like http extension, it's not essential that it's loaded by default, and leaving it in pecl allows development outside that of the core? I think this is probably a good idea at this stage. It will give people a chance to play around with it in an experimental state before committing to maintaining a particular API. Since there's no real BC break here, there's no reason it couldn't be bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush into including it in 7.0 straight from what feels like a prototype implementation. Sounds reasonable. Joe, I don't have much time to help, but I'm willing to help UString development. I think it's better to keep it simple. Having unified internal encoding (NFC normalized UTF-8 string without BOM) for internal string representation would be much simpler than multiple encodings. We may consider various issues/ideas like this in relatively long term. http://websec.github.io/unicode-security-guide/character-transformations/ http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
Hi Florian, On Mon, Mar 2, 2015 at 5:57 AM, Florian Margaine flor...@margaine.com wrote: Le 1 mars 2015 21:26, Derick Rethans der...@php.net a écrit : Hey Joe, I think there are a few issues with the proposal, although I like the general idea. I've had the tab with the RFC open since October... but never looked at it until now :-/. So, a few comments: - UString as a name. I think I am going to prefer Text as a class name. Unicode (and intl/icu) have lots of operators acting on items containing unicode strings. But they are really pieces of text. For example sentences, word break iterators, etc. UString *feels* clunky, and not standard. If it's going to be part of PHP core, then we should pick a core name. (I might prefer String, but that's going to cause a whole lot of issues obviously). Isn't this solved if we use \php\String? I suppose we need Context Sensitive Lexer for String, but I guess it passes. Let's use namespace for new internal classes at least. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
On 01/03/2015 20:59, Yasuo Ohgaki wrote: However, I don't mind too much allowing any encoding stored in Text/ UString object. IIRC, Ruby does this and have not much problem. As I understand it, Ruby's string type is actually a whole bunch of overloaded types, each responsible for re-implementing the various methods available. This leads to a whole bunch of partially supported encodings/codepages, which is a big pile of leaky abstraction for the small benefit of removing re-encoding operations in a few scenarios. Unicode is explicitly designed to supersede all previous encodings, so it makes much perfect sense to me to use it to internally represent what the user just wants to think of as text. The fact that within that internal representation you need some byte-level encoding then leads to the optimisation of using a byte-level encoding the user is likely to use as input and output, i.e. UTF-8. Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Joe, On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins pthre...@pthreads.org wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. I appreciate your proposal! Rowan pointed out some important things. I don't understand details as I don't read your code yet. I'll try to read and comment in a few days. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
Hi Joe, On Sun, Mar 1, 2015 at 7:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote: public function __construct([string $string [, string $source_codepage [, string $substitute_char] ]); One additional comment for constructor. It should have default normalization. I think it should be NFC as most system uses it. (OSX uses NFD for filenames! I hate it and most of Japanese developers hate it) The API may be public function __construct([string $string [, string $source_codepage [, string $substitute_char [, $normalization] ]); If $substitute_char is NULL, disallow invalid encoding. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
Hi Joe, On Sun, Mar 1, 2015 at 6:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote: On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins pthre...@pthreads.org wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. I appreciate your proposal! Rowan pointed out some important things. I don't understand details as I don't read your code yet. I'll try to read and comment in a few days. I guess you would like to start voting today or tomorrow, so I briefly read your code. I think your approach is good. I like UString be UTF-8 always by default regardless of other settings. i.e. default_charset, internal_encoding. I see few missing key APIs that would be critical for multibyte char handling, like string length, string width, normalization, string conversions like Zenkaku to Hankaku, encoding(codepage) converter. However, all of these may be added later as they are already implemented in ICU. I think UString may be better to use UTF-8 always to make users life a little simpler. Your constructor only have codepage setting that is used as UString codepage to support other codepage(encodings). Rather than to have various encoding support, I think constructor needs encoding(codepage) conversion feature. Codepage parameter is better to be used as from encoding(codepage) parameter and convert any encoding(codepage) to UTF-8. If conversion fails, it should raise exception. It's better to have forgiving API for malformed strings if user explicitly specified to do so. Constructor may be public function __construct([string $string [, string $source_codepage [, string $substitute_char] ]); $soure_codepage is source string encoding(codepage) and $string is converted to UTF-8 always. If $substitute_char is omitted, raise exception for invalid $string. If $substitute_char is specified (it can be '' empty string), convert $string according to $source_codepage and just remove/replace invalid byte stream in $string. With this constructor, string stored in UString object is always valid UTF-8. Any character encoding (including UTF-16/32 and 200 encoding names supported by ICU) may be used as source string. Since there will be no variable codepage setting for UString object, followings may be removed. public static function getDefaultCodepage(); public static function setDefaultCodepage(string $codepage); ICU uses codepage as character encoding, but it may be better to use character encoding as people are not used to ICU terminology. This is what I thought. I didn't read your code carefully, so I might be wrong. Please correct me if I'm mistaken. I suppose there are other people working on Unicode string based simpler libraries. I would like to hear opinion from them. BTW, we really need byte_len(). strlen() is just confusing API... It's not a scope of this RFC, though. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
On 01/03/2015 20:34, Lester Caine wrote: On 28/02/15 06:48, Joe Watkins wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. Since there is nothing in this which needs any changes to the core then surly it simply needs to exist in pecl until such time as a proper replacement for unicode in core strings has been addressed? Since it will still require intl to provide those areas it does not support, and I question if we really need to provide yet another encoding converter. A unicode string handler that just handles UTF8 strings may be yet another stepping stone, but it still falls short of beings able to handle all of the internationalization problems and is simply an alternate to mbstring so one either runs both, or sit down and convert all the third party libraries to eliminate mbstring. Like http extension, it's not essential that it's loaded by default, and leaving it in pecl allows development outside that of the core? I think this is probably a good idea at this stage. It will give people a chance to play around with it in an experimental state before committing to maintaining a particular API. Since there's no real BC break here, there's no reason it couldn't be bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush into including it in 7.0 straight from what feels like a prototype implementation. Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi Joe and Rowan, On Mon, Mar 2, 2015 at 7:14 AM, Yasuo Ohgaki yohg...@ohgaki.net wrote: Hi Joe and Rowan, On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.coll...@gmail.com wrote: On 01/03/2015 20:34, Lester Caine wrote: On 28/02/15 06:48, Joe Watkins wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. Since there is nothing in this which needs any changes to the core then surly it simply needs to exist in pecl until such time as a proper replacement for unicode in core strings has been addressed? Since it will still require intl to provide those areas it does not support, and I question if we really need to provide yet another encoding converter. A unicode string handler that just handles UTF8 strings may be yet another stepping stone, but it still falls short of beings able to handle all of the internationalization problems and is simply an alternate to mbstring so one either runs both, or sit down and convert all the third party libraries to eliminate mbstring. Like http extension, it's not essential that it's loaded by default, and leaving it in pecl allows development outside that of the core? I think this is probably a good idea at this stage. It will give people a chance to play around with it in an experimental state before committing to maintaining a particular API. Since there's no real BC break here, there's no reason it couldn't be bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush into including it in 7.0 straight from what feels like a prototype implementation. Sounds reasonable. Joe, I don't have much time to help, but I'm willing to help UString development. I think it's better to keep it simple. Having unified internal encoding (NFC normalized UTF-8 string without BOM) for internal string representation would be much simpler than multiple encodings. We may consider various issues/ideas like this in relatively long term. http://websec.github.io/unicode-security-guide/character-transformations/ http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html We used to have EXPERIMENTAL module. How about have this as EXPERIMENTAL module in source distribution? It gets more attentions and development will be faster. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
Hey Joe, I think there are a few issues with the proposal, although I like the general idea. I've had the tab with the RFC open since October... but never looked at it until now :-/. So, a few comments: - UString as a name. I think I am going to prefer Text as a class name. Unicode (and intl/icu) have lots of operators acting on items containing unicode strings. But they are really pieces of text. For example sentences, word break iterators, etc. UString *feels* clunky, and not standard. If it's going to be part of PHP core, then we should pick a core name. (I might prefer String, but that's going to cause a whole lot of issues obviously). - Needs More Methods I had a look at the API that that links to, and I miss operators like iterators. Over words, sentences, characters, etc. Basically the functionality of http://docs.php.net/manual/en/class.intlbreakiterator.php, http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php I realize intl already immplements, this, but it's really beneficial to have for a Text class - especially for replacing functionality where people now look over a string - with a character index. - Not a full String API Replacement I would certainly expect more from it than just the UnicodeString API. Perhaps not for a first iteration, but certainly for subsequent versions. Things like transliterations, and specifically iterators would be high on my list. - Patch toUpper/toLower, there is a missing one for toTitle - In the code's README: Note: UString is interchangable with zend strings for method parameters and can be cast for output/conversion to zend strings How does that work? And what would it convert to? - How are characters counted? Is a character a Code Point, or is a character a base character + combining diacritics. In the first form, A + ° is considered as characters, in the second option, just one. For wordwrap, splice, substring, it is really important that only the *full sequence* is considered as a character. And hence, a character really should be the full sequence. The text in charAt seems to contradict that, and that is a mistake. In the original PHP 6 we didn't do that due to perormance reasons, but that point is moot now as only people who opt into using Text will suffer from this. - trim What is a leading or trailing space? Is it just U+0020, or other Unicode defined space characters as well? (nbsp;, U+00A0 comes to mind here) - What is UG(defaultpad), about? - For the code: - there is some interesting, non standard whitespaceing going on: - { goes on next line after a func decl - sometimes 4 spaces in stead of a tab are used for indentation, - Why is there no __toString() ? - How can other extensions, not really making use of Text, use there strings (as UTF8 strings f.e.) cheers, Derick On Sat, 28 Feb 2015, Joe Watkins wrote: Morning internals, This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. Cheers Joe On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net wrote: On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote: On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. /me cringes ... I wonder how much of a problem it really is, usually when we say some function name is a problem is because of hundreds and hundreds of results on github. If it's a huge problem then we should rename it, if we have to dig around for a single project that's incompatible, or even a handful, then it's not really a problem. Cheers Joe I can see this being something relatively common. While I personally would never do it, there are a few reasons I can think of that people *might* do it: - Wrapper for creating u HTML output - urlencode() shortcut - (obviously) various unicode-related things Searching on codesearch [1] revealed (amongst a few other hits on the first page) another interesting use of it in the hhvm test suite [2]. It's difficult to search for this because all the available public search engines that I know of do fuzzy matching. Sorry. This sucks, because every other option we have for this is sucks. On the bright side, anything chosen could always be aliased at the top of the file: use function __u as u; This also sucks, but it sucks a little bit less because the
Re: [PHP-DEV] [RFC] UString
On Sun, 1 Mar 2015, Yasuo Ohgaki wrote: Hi Joe, On Sun, Mar 1, 2015 at 7:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote: public function __construct([string $string [, string $source_codepage [, string $substitute_char] ]); One additional comment for constructor. It should have default normalization. I think it should be NFC as most system uses it. (OSX uses NFD for filenames! I hate it and most of Japanese developers hate it) The API may be public function __construct([string $string [, string $source_codepage [, string $substitute_char [, $normalization] ]); I wouldn't leave normalization as an option, and certainly not done by default. I would suggest other (mutable) methods, to convert between normalisation forms. If $substitute_char is NULL, disallow invalid encoding. I don't think substitions (ie, data loss) should be allowed at all. This should thrown an immediate exception. If you really want this, I suggest adding a factory method for this. i.e. Text::createWithSubstitutions - or whatever better name. cheers, Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 28/02/15 06:48, Joe Watkins wrote: This is just a quick note to announce my intention to ready this RFC for voting next week. Since there is nothing in this which needs any changes to the core then surly it simply needs to exist in pecl until such time as a proper replacement for unicode in core strings has been addressed? Since it will still require intl to provide those areas it does not support, and I question if we really need to provide yet another encoding converter. A unicode string handler that just handles UTF8 strings may be yet another stepping stone, but it still falls short of beings able to handle all of the internationalization problems and is simply an alternate to mbstring so one either runs both, or sit down and convert all the third party libraries to eliminate mbstring. Like http extension, it's not essential that it's loaded by default, and leaving it in pecl allows development outside that of the core? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Sat, 28 Feb 2015, Rowan Collins wrote: On 28/02/2015 06:48, Joe Watkins wrote: Morning internals, This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. I still think this class is trying to do several jobs, and not doing any of them very well, and I fear that people will see this class and expect it to solve problems which it actually ignores. Here are some concrete use cases I would like a simple interface to solve for me: - Take text from an ISO 88592-2 data source, pass it through generic text filters, and pass it to a UTF-16 data target. - Given a long string of Unicode text, give me a valid UTF-8 string which fits into a buffer with fixed byte size; i.e. give me the largest number of whole code points which fit into that number of bytes once encoded. - As above, but without stripping diacritics off the last character of the resulting string, i.e. give me the largest number of whole graphemes which fit. - Split a string into equal sized chunks of readable characters (graphemes), regardless of how many bytes or code points each chunk contains. UString currently falls short of all of these: - I can specify my input encoding (in the constructor or helper method, over-riding a static default, which is equivalent to ext/mbstring's global setting), but not my output encoding (there is no method to ask for a byte representation other than a string cast, which by definition has no parameters). Yeah, there should be an output method to convert to a target encoding. - I can ask for a fixed number of code points, but don't know how many bytes these will take until I cast to a UTF-8 string. As I said before, indexes into strings should not be done on code points, as the following would then break the characters: $s = new Text(Ås); echo $s-substring(1); The output would be:̊ Where as: $s = new Text(Ås); echo $s-substring(1); would output s. Which is not what people would expect. - I can't manipulate anything at the grapheme level at all, even though this is the most meaningful level of operation in most cases. Yes - graphemes should be the base blocks, not code points. Things it does do: - a handful of methods give meaningful international text support: toUpper(), toLower(), trim() - some methods could be done on byte strings if I ensure they're all in UTF-8: replace(), contains(), startsWith(), endsWith(), repeat() That doesn't always work when you have graphemes, or text in different normalisation forms. Ie, it should consider Å U+00C5 and Å (U+0041 + U+030A) the same for contains and startsWith — ie, handle normalisation for comparison. - there may be limited situations where I want to dive into the code points which make up a string, although I can't think of many: $length, pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice() Break iterators on either code points, or graphemes, might work here? - remaining methods avoid me creating invalid UTF-8, but don't help me much with real-life text: chunk(), split(), substring() - I can ask what codepage my Unicode string is in; I don't even understand what this means I think an efficient OO wrapper around ICU is a great idea, but more thought needs to go into what methods are exposed, and how people are going to use them in real code. Yes - I agree. I think this current proposal is a good start, but it needs to be worked out a little bit more before I think we should vote on it — how much I would like to see something like this in PHP. cheers, Derick -- http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug Posted with an email client that doesn't mangle email: alpine -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi, Le 1 mars 2015 21:26, Derick Rethans der...@php.net a écrit : Hey Joe, I think there are a few issues with the proposal, although I like the general idea. I've had the tab with the RFC open since October... but never looked at it until now :-/. So, a few comments: - UString as a name. I think I am going to prefer Text as a class name. Unicode (and intl/icu) have lots of operators acting on items containing unicode strings. But they are really pieces of text. For example sentences, word break iterators, etc. UString *feels* clunky, and not standard. If it's going to be part of PHP core, then we should pick a core name. (I might prefer String, but that's going to cause a whole lot of issues obviously). Isn't this solved if we use \php\String? - Needs More Methods I had a look at the API that that links to, and I miss operators like iterators. Over words, sentences, characters, etc. Basically the functionality of http://docs.php.net/manual/en/class.intlbreakiterator.php, http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php I realize intl already immplements, this, but it's really beneficial to have for a Text class - especially for replacing functionality where people now look over a string - with a character index. - Not a full String API Replacement I would certainly expect more from it than just the UnicodeString API. Perhaps not for a first iteration, but certainly for subsequent versions. Things like transliterations, and specifically iterators would be high on my list. - Patch toUpper/toLower, there is a missing one for toTitle - In the code's README: Note: UString is interchangable with zend strings for method parameters and can be cast for output/conversion to zend strings How does that work? And what would it convert to? - How are characters counted? Is a character a Code Point, or is a character a base character + combining diacritics. In the first form, A + ° is considered as characters, in the second option, just one. For wordwrap, splice, substring, it is really important that only the *full sequence* is considered as a character. And hence, a character really should be the full sequence. The text in charAt seems to contradict that, and that is a mistake. In the original PHP 6 we didn't do that due to perormance reasons, but that point is moot now as only people who opt into using Text will suffer from this. - trim What is a leading or trailing space? Is it just U+0020, or other Unicode defined space characters as well? (nbsp;, U+00A0 comes to mind here) - What is UG(defaultpad), about? - For the code: - there is some interesting, non standard whitespaceing going on: - { goes on next line after a func decl - sometimes 4 spaces in stead of a tab are used for indentation, - Why is there no __toString() ? - How can other extensions, not really making use of Text, use there strings (as UTF8 strings f.e.) cheers, Derick On Sat, 28 Feb 2015, Joe Watkins wrote: Morning internals, This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. Cheers Joe On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net wrote: On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote: On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. /me cringes ... I wonder how much of a problem it really is, usually when we say some function name is a problem is because of hundreds and hundreds of results on github. If it's a huge problem then we should rename it, if we have to dig around for a single project that's incompatible, or even a handful, then it's not really a problem. Cheers Joe I can see this being something relatively common. While I personally would never do it, there are a few reasons I can think of that people *might* do it: - Wrapper for creating u HTML output - urlencode() shortcut - (obviously) various unicode-related things Searching on codesearch [1] revealed (amongst a few other hits on the first page) another interesting use of it in the hhvm test suite [2]. It's difficult to search for this because all the available public search engines that I know of do fuzzy matching. Sorry. This sucks, because every other option we have for this is
Re: [PHP-DEV] [RFC] UString
Hi Joe and Derick, On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans der...@php.net wrote: I think there are a few issues with the proposal, although I like the general idea. I've had the tab with the RFC open since October... but never looked at it until now :-/. So, a few comments: - UString as a name. I think I am going to prefer Text as a class name. Unicode (and intl/icu) have lots of operators acting on items containing unicode strings. But they are really pieces of text. For example sentences, word break iterators, etc. UString *feels* clunky, and not standard. If it's going to be part of PHP core, then we should pick a core name. (I might prefer String, but that's going to cause a whole lot of issues obviously). I think it's better to have string/text data as certain encoding/codepage. Although Unicode encoding conversion is cheap, (I mean cheap compare to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8 is better because - PCRE only supports UTF-8 - SQLite only supports UTF-8 - PHP uses UTF-8 as the default now - Recent web apps uses UTF-8 as encoding - Single encoding for stored text/string is simpler - Considering normalization, having UTF-8 with NFC is less confusing. However, I don't mind too much allowing any encoding stored in Text/ UString object. IIRC, Ruby does this and have not much problem. If we have multiple encoding support. We should resolve $new = $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error? $new = $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error? $new = $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled? - Needs More Methods I had a look at the API that that links to, and I miss operators like iterators. Over words, sentences, characters, etc. Basically the functionality of http://docs.php.net/manual/en/class.intlbreakiterator.php, http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php I realize intl already immplements, this, but it's really beneficial to have for a Text class - especially for replacing functionality where people now look over a string - with a character index. There are missing features... We may implement most of them before release. - Not a full String API Replacement I would certainly expect more from it than just the UnicodeString API. Perhaps not for a first iteration, but certainly for subsequent versions. Things like transliterations, and specifically iterators would be high on my list. Sounds good. - Patch toUpper/toLower, there is a missing one for toTitle - In the code's README: Note: UString is interchangable with zend strings for method parameters and can be cast for output/conversion to zend strings How does that work? And what would it convert to? I guess Joe means it's using zend_string internally? - How are characters counted? Is a character a Code Point, or is a character a base character + combining diacritics. In the first form, A + ° is considered as characters, in the second option, just one. For wordwrap, splice, substring, it is really important that only the *full sequence* is considered as a character. And hence, a character really should be the full sequence. The text in charAt seems to contradict that, and that is a mistake. One reason I prefer NFC. In the original PHP 6 we didn't do that due to perormance reasons, but that point is moot now as only people who opt into using Text will suffer from this. - trim What is a leading or trailing space? Is it just U+0020, or other Unicode defined space characters as well? (nbsp;, U+00A0 comes to mind here) Any space is better to be trimmed. - What is UG(defaultpad), about? - For the code: - there is some interesting, non standard whitespaceing going on: - { goes on next line after a func decl - sometimes 4 spaces in stead of a tab are used for indentation, - Why is there no __toString() ? If this is missing, there should be __toString() - How can other extensions, not really making use of Text, use there strings (as UTF8 strings f.e.) I agree that Internal API needs improvement. Overall, I think it's good for starting if basic issue is resolved. The most important is if it supports single or multiple encoding for stored text/string?. There are many things programmers should know if multiple encoding is supported, but I don't object strongly to have multiple encoding support. It's nice to have ability to handle SJIS, ISO-2022, etc natively. Regards, -- Yasuo Ohgaki yohg...@ohgaki.net
Re: [PHP-DEV] [RFC] UString
On 28/02/2015 06:48, Joe Watkins wrote: Morning internals, This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. I still think this class is trying to do several jobs, and not doing any of them very well, and I fear that people will see this class and expect it to solve problems which it actually ignores. Here are some concrete use cases I would like a simple interface to solve for me: - Take text from an ISO 88592-2 data source, pass it through generic text filters, and pass it to a UTF-16 data target. - Given a long string of Unicode text, give me a valid UTF-8 string which fits into a buffer with fixed byte size; i.e. give me the largest number of whole code points which fit into that number of bytes once encoded. - As above, but without stripping diacritics off the last character of the resulting string, i.e. give me the largest number of whole graphemes which fit. - Split a string into equal sized chunks of readable characters (graphemes), regardless of how many bytes or code points each chunk contains. UString currently falls short of all of these: - I can specify my input encoding (in the constructor or helper method, over-riding a static default, which is equivalent to ext/mbstring's global setting), but not my output encoding (there is no method to ask for a byte representation other than a string cast, which by definition has no parameters). - I can ask for a fixed number of code points, but don't know how many bytes these will take until I cast to a UTF-8 string. - I can't manipulate anything at the grapheme level at all, even though this is the most meaningful level of operation in most cases. Things it does do: - a handful of methods give meaningful international text support: toUpper(), toLower(), trim() - some methods could be done on byte strings if I ensure they're all in UTF-8: replace(), contains(), startsWith(), endsWith(), repeat() - there may be limited situations where I want to dive into the code points which make up a string, although I can't think of many: $length, pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice() - remaining methods avoid me creating invalid UTF-8, but don't help me much with real-life text: chunk(), split(), substring() - I can ask what codepage my Unicode string is in; I don't even understand what this means I think an efficient OO wrapper around ICU is a great idea, but more thought needs to go into what methods are exposed, and how people are going to use them in real code. Regards, -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Morning internals, This is just a quick note to announce my intention to ready this RFC for voting next week. I know I'm a little late maybe, I was real sick most of last week, so couldn't do anything useful. A couple of us intend to fix outstanding issues on github and those raised here, tidy the RFC and open the vote for 7. I would ask anyone interested to scan through this thread and announce concerns that are not mentioned asap. Cheers Joe On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net wrote: On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote: On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. /me cringes ... I wonder how much of a problem it really is, usually when we say some function name is a problem is because of hundreds and hundreds of results on github. If it's a huge problem then we should rename it, if we have to dig around for a single project that's incompatible, or even a handful, then it's not really a problem. Cheers Joe I can see this being something relatively common. While I personally would never do it, there are a few reasons I can think of that people *might* do it: - Wrapper for creating u HTML output - urlencode() shortcut - (obviously) various unicode-related things Searching on codesearch [1] revealed (amongst a few other hits on the first page) another interesting use of it in the hhvm test suite [2]. It's difficult to search for this because all the available public search engines that I know of do fuzzy matching. Sorry. This sucks, because every other option we have for this is sucks. On the bright side, anything chosen could always be aliased at the top of the file: use function __u as u; This also sucks, but it sucks a little bit less because the collisions are avoided - or at least, avoided in such a way that the onus is on the user - and one can still have the sane name. First-class support at the syntax level (presumably $foo = uunicode string since we already have $foo = bbinary string) would IMO be better and (hopefully?) a long-term goal, but I am aware that it is - and probably should be - outside the scope of the current proposal. [1] https://searchcode.com/?q=function+u+lang%3Aphp [2] https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. /me cringes ... I wonder how much of a problem it really is, usually when we say some function name is a problem is because of hundreds and hundreds of results on github. If it's a huge problem then we should rename it, if we have to dig around for a single project that's incompatible, or even a handful, then it's not really a problem. Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote: On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote: Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. /me cringes ... I wonder how much of a problem it really is, usually when we say some function name is a problem is because of hundreds and hundreds of results on github. If it's a huge problem then we should rename it, if we have to dig around for a single project that's incompatible, or even a handful, then it's not really a problem. Cheers Joe I can see this being something relatively common. While I personally would never do it, there are a few reasons I can think of that people *might* do it: - Wrapper for creating u HTML output - urlencode() shortcut - (obviously) various unicode-related things Searching on codesearch [1] revealed (amongst a few other hits on the first page) another interesting use of it in the hhvm test suite [2]. It's difficult to search for this because all the available public search engines that I know of do fuzzy matching. Sorry. This sucks, because every other option we have for this is sucks. On the bright side, anything chosen could always be aliased at the top of the file: use function __u as u; This also sucks, but it sucks a little bit less because the collisions are avoided - or at least, avoided in such a way that the onus is on the user - and one can still have the sane name. First-class support at the syntax level (presumably $foo = uunicode string since we already have $foo = bbinary string) would IMO be better and (hopefully?) a long-term goal, but I am aware that it is - and probably should be - outside the scope of the current proposal. [1] https://searchcode.com/?q=function+u+lang%3Aphp [2] https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote: Hi! I wish there was a way for specific objects to opt into this. There will be, if __hashKey() or whatever would be the properly bikeshedded name, becomes reality as discussed elsewhere. It shouldn't be hard to do and it's exactly what many other languages do when trying to use objects as keys for maps. Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey But it exists, I think it solves a problem for ustring in particular but it solves the problem in general too. No time to write about it or discuss it at this moment, but in pipeline, hopefully ... Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote: On 21/10/2014 08:06, Joe Watkins wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe I think this looks like a really great start at creating something actually useful, rather than getting stuck at the drawing board. I like that the scope is quite small initially - where does the single responsibility of a class that represents a string end, anyway? :) A few opinions: 1) Global / static defaults are bad. The existence of the setDefaultCodepage method feels like an anti-pattern to me. It means libraries can't rely on this class working the same way in two different host environments, or even at two re-entries in the same program. Effectively, if you don't know what the second argument to the constructor will default to, you can't actually treat it as optional unless you're writing monolithic code. This is a common pattern in PHP, but http_build_query() would be so much more pleasant if I could safely call it with 1 argument instead of 3. I think the default should be hard-coded to UTF-8, which according to previous discussion is always the default *output* encoding, so would mean this would always work: $aUString = new UString( (string)$aUString ); Any other encoding will be dependent on, and known from, the context where the object is created - if grabbing data from an HTTP request, a header should tell them; if from a database, a connection parameter; and so on. Could be true, it feels quite horrible to me today too, I think someone else suggested it, but it might have been me. I'll look at doing something about that ... The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected. I guess if we ever went down the route of special literal syntax like u'literal string', the declared source encoding could be used. Actually, the u() shortcut function appears to be missing the encoding parameter completely; is this deliberate? Fixed that. 2) Clarify relationship to a byte string Most of the API acts like this is an abstract object representing a bunch of Unicode code points. As such, I'm not sure what getCodepage() does - a code page (or more properly encoding) is a property of a stream of bytes, so has no meaning in this context, surely? The internal implementation could use UTF-8, UTF-16, or some made-up encoding (like Perl6's NFG system) and the user should never need to know (other than to understand performance implications). On the other hand, when you *do* want a stream of bytes, the class doesn't seem to have an explicit way to get one. The (currently undocumented) behaviour is apparently to spit out UTF-8 if cast to a string, but it would be nice to have an explicit function which could be passed a parameter in order to serialise to, say, UTF-16, instead. I reused the terminology used by ICU, it made sense in their documentation. So we want a ::getBytes or something like that ... I'll do that ... 3) The Grapheme Question This has been raised a few times, so I won't labour the point, just mention my current thinking. Unicode is complicated. Partly, that's because of a series of compromises in its design; but partly, it's because writing systems are complicated, and Unicode tries harder than most previous systems to acknowledge that. So, there's a tradeoff to be made between giving users what they think they need, thus hiding the messy details, and giving users the power to do things right, in a more complex way. There is also a namespace mess if you insist on every function and property having to declare what level of abstraction it's talking about - e.g. $codePointLength instead of $length. An idea I've been toying with is rather than having one class representing the slippery notion of a Unicode string, having (at least) two, closely tied, classes: CodePointString (roughly = UString right now) and GraphemeString (a higher level abstraction tied to the same internal representation). I intend to mock this up as a set of interfaces at some point, but the basic idea is that you could write this: // Get an abstract object from a byte string, probably a GraphemeString, parsing the input as UTF-8 $str = u('some text'); // Perform an operation that explicitly deals in Code Points $str = $str-asCodePoints()-normalise('NFC'); // Get information using a higher level of abstraction $length = $str-asGraphemes()-length; // Perform a high-level mutation, then convert right
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 10:28 -0700, Stas Malyshev wrote: Hi! https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Couple of thoughts: - I like the idea of having a unicode string class. May be a way to figure out the right way to do it without messing up the whole core. - I wish there were more description of which API this class provides. If it's planned to be direct copy of UnicodeString, some of the operations there are not how PHP strings usually work (i.e. in-place modification) and it's not really enough to make it useful - e.g. what if I need to do regexps on it, for example? Or does it cover whole mbstring API too? What about something mbstring doesn't cover, like ucfirst or strrev? API on github in readme. Regexp not covered yet, ICU has a nicer Matcher/Pattern API like Java's, I'm not sure what to do there, an ICU based API could certainly be introduced. - Do we really need different encodings, different backends and so on, internally? Note that each backend has its own quirks, limitations and bugs, and there's nothing worse than dealing with unpredictable set of dependencies. The user cares what they send into the class and what comes out, but very rarely they care what happens inside - why not just do it one way everywhere? No, actually, I don't think we do. It was over complicating something simple, so I removed the backend abstraction and will work towards solving the rest too. We'll use ICU, because battle tested like nothing else, and keeps everything simple ... it doesn't make sense to introduce a possibly unstable and as you rightly say different API with it's own quirks. Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 07:49 -0700, Sara Golemon wrote: On Oct 21, 2014, at 0:06, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. The backend abstraction seems overengineered to me. It could also lead to inconsistencies in behavior if ICU and Windows implement something in subtly different ways. Since we're linking ICU for the rest of the intl extension anyway, it seems to me like we should just focus on it as an ICU wrapper. Also, I'd peopose a minor ammendment to this RFC that other intl classes be extended to support taking UString instances as arguments (avoiding the implicit conversion to UTF8). That work doesn't have to gate adoption of the base implementation, it'd just be useful to decide at the same time if we want to do so. -Sara Actually I agree, I just needed a few people to say WTF. Backend gone, we are gonna use ICU, rfc/ext updated. INTL is still an open question yeah, preference noted. Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
this won't completely solve the problem, because array keys won't be UString anymore. Thanks. Dmtiry. On Thu, Oct 23, 2014 at 12:11 PM, Joe Watkins pthre...@pthreads.org wrote: On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote: Hi! I wish there was a way for specific objects to opt into this. There will be, if __hashKey() or whatever would be the properly bikeshedded name, becomes reality as discussed elsewhere. It shouldn't be hard to do and it's exactly what many other languages do when trying to use objects as keys for maps. Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey But it exists, I think it solves a problem for ustring in particular but it solves the problem in general too. No time to write about it or discuss it at this moment, but in pipeline, hopefully ... Cheers Joe
Re: [PHP-DEV] [RFC] UString
Joe Watkins wrote on 23/10/2014 09:18: I'd rather higher level stuff existed at a higher level, I'd rather solve for ustring the problems that are solved for normal strings and leave the rest up to whatever the framework/component/library or wants to do. It's not really higher level in terms of the problem being solved, it's the same functions applied to a higher abstraction of what string means. It doesn't make much sense to say that u($foo)-length solves the same problem as strlen($foo), but grapheme_strlen($foo) is somehow higher level. They're three different definitions of the word length which can be applied to the same string, and it would be nice if they were all accessible through the same API. I get the feeling people are thinking of grapheme functions as something exotic and hard to implement, but ext/intl seems to have a very straight-forward set of functions for them: http://php.net/manual/en/ref.intl.grapheme.php The two-interfaces idea was just to get over the naming problem of prefixing everything with codePointX or graphemeX, and wouldn't actually require a separate data structure under the hood. -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 12:44 +0400, Dmitry Stogov wrote: this won't completely solve the problem, because array keys won't be UString anymore. http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode() Others solve this problem in exactly this way, the Java implementation requires that you return an int. The one in that draft will allow you to return any scalar. This is much more suitable for PHP. It doesn't solve the problem directly but allows the programmer to solve it for themselves, just like Object.hashCode in Java. Thanks. Dmtiry. On Thu, Oct 23, 2014 at 12:11 PM, Joe Watkins pthre...@pthreads.org wrote: On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote: Hi! I wish there was a way for specific objects to opt into this. There will be, if __hashKey() or whatever would be the properly bikeshedded name, becomes reality as discussed elsewhere. It shouldn't be hard to do and it's exactly what many other languages do when trying to use objects as keys for maps. Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey But it exists, I think it solves a problem for ustring in particular but it solves the problem in general too. No time to write about it or discuss it at this moment, but in pipeline, hopefully ... Cheers Joe Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 23 Oct 2014, at 09:44, Dmitry Stogov dmi...@zend.com wrote: this won't completely solve the problem, because array keys won't be UString anymore. Sure, but unless we turn arrays into SplObjectStorage that won’t change. Nobody wants to touch arrays and make them support other key types. Heck, my bigint RFC doesn’t even do that. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Dmitry Stogov wrote on 21/10/2014 10:01: The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). Isn't that kind of what ext/mbstring does? I think that treating Unicode as nothing more than an encoding, and trying to hide all its complexity from the user, is not particularly wise. Unicode isn't just ASCII, but bigger, so keeping the same API but making the implementation work with more characters isn't really Unicode support. For instance, what does allowing Unicode strings as array keys actually mean? We already allow pretty much any sequence of bytes as an array key, so what we're actually talking about is that array-handling functions should be somehow Unicode aware. In the case of sorting functions, that means a mechanism for selecting a collation, even if you know how the strings are encoded. There are a handful of operations which have an obvious meaning under Unicode - strtoupper(), for instance. It might be nice if those worked transparently with UStrings, but I don't think that really constitutes complete Unicode support either. I think we're going to keep going round in circles unless we can really pin down what it means for a language to support Unicode. -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 23 Oct 2014, at 14:44, Rowan Collins rowan.coll...@gmail.com wrote: Dmitry Stogov wrote on 21/10/2014 10:01: The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). Isn't that kind of what ext/mbstring does? I think that treating Unicode as nothing more than an encoding, and trying to hide all its complexity from the user, is not particularly wise. Unicode isn't just ASCII, but bigger, so keeping the same API but making the implementation work with more characters isn't really Unicode support”. I’m inclined to agree here. Having an encoding-aware zend_string vs. having a Unicode-aware string aren’t quite the same. Certain string operations are only possible for certain encodings, and by supporting any encoding we risk making things confusing. I’d rather we convert everything to Unicode. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote: It doesn't solve the problem directly but allows the programmer to solve it for themselves, just like Object.hashCode in Java. The point is that it won't work in this way: $a = [ $ustring = $value ]; foreach ($a as $key = $v) { $key-ustring_method(); } but one needs something along the lines of $a = [ $ustring = $value ]; foreach ($a as $key = $v) { Utring::fromHashCode($key)-ustring_method(); } which likely looses object identity. It works but is not really nice :-) johannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 23 Oct 2014, at 14:53, Johannes Schlüter johan...@schlueters.de wrote: On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote: It doesn't solve the problem directly but allows the programmer to solve it for themselves, just like Object.hashCode in Java. The point is that it won't work in this way: $a = [ $ustring = $value ]; foreach ($a as $key = $v) { $key-ustring_method(); } but one needs something along the lines of $a = [ $ustring = $value ]; foreach ($a as $key = $v) { Utring::fromHashCode($key)-ustring_method(); } which likely looses object identity. It works but is not really nice :-) u($key)-split(',')-... works :) -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 14:59 +0100, Andrea Faulds wrote: On 23 Oct 2014, at 14:53, Johannes Schlüter johan...@schlueters.de wrote: On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote: It doesn't solve the problem directly but allows the programmer to solve it for themselves, just like Object.hashCode in Java. The point is that it won't work in this way: $a = [ $ustring = $value ]; foreach ($a as $key = $v) { $key-ustring_method(); } but one needs something along the lines of $a = [ $ustring = $value ]; foreach ($a as $key = $v) { Utring::fromHashCode($key)-ustring_method(); } which likely looses object identity. It works but is not really nice :-) u($key)-split(',')-... works :) While that's something else from the original example and makes this behave not like an integral part of the language. The proper solution would be a unicode type, but PHP 6 showed that this is not going to work out and this is way better than what we have right now, though and a good step in the right direction. We probably might integrate it in the core language more and more. My point is to stress that this is incomplete, as Dmitry said, and that we should not take this alone as the final solution forever. johannes P.S. u() is a bad name, will break lots of code, i.e. https://code.openhub.net/file?fid=wRj6MYm-GPDxPidisWYoLa23wFccid=CCYlIMOwTkss=fndef%3Aupp=0fl=PHPff=1filterChecked=truefp=126888mp,=1ml=1me=1md=1projSelected=true#L0 will give weird runtime behavior as their definition is guarded by a function_exists check but both functions do completely different things.. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi! Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess we should combine them :) -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi! P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 23 Oct 2014, at 20:54, Stas Malyshev smalys...@sugarcrm.com wrote: P.S. u() is a bad name, will break lots of code, i.e. Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe. I don't like that. This might sound crazy, but what about adding Unicode string literals to the parser, e.g. ufoo bar\u{202e}你好? If the UString extension isn't available, just error. It wouldn't be the first time we had disableable syntax features (``), and this avoids any possible conflicts. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 12:47 -0700, Stas Malyshev wrote: Hi! Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess we should combine them :) Happy to port patch already written to conform to your specification, (more or less complies, other than name) you are welcome to go ahead and do the RFC bit ? Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Thu, 2014-10-23 at 12:47 -0700, Stas Malyshev wrote: Hi! Not ready for discussion yet ... https://wiki.php.net/rfc/hashkey Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess we should combine them :) Done, branch @ http://github.com/krakjoe/php-src/compare/hashkey Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21 October 2014 23:21:37 GMT+01:00, Andrea Faulds a...@ajf.me wrote: On 21 Oct 2014, at 21:42, Rowan Collins rowan.coll...@gmail.com wrote: The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected. This is only a good idea if we can somehow make it file-local. Otherwise if one library uses Latin-1 and another uses UTF-8 for some reason, bang! Yes, I used the word declared advisedly, because I was thinking it could take its default encoding (if we were to go down the route of special literal syntax rather than wrapper-function) from the existing declare(encoding='...') directive, rather than a global variable or setting. http://php.net/manual/en/control-structures.declare.php#control-structures.declare.encoding -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21 October 2014 23:21:37 GMT+01:00, Andrea Faulds a...@ajf.me wrote: Make array-like indexing with [] be by code points as you may be able to do that in constant time If the internal representation is UTF8, both code point and grapheme access require traversal unless you have some additional index structure. Both can be trivialised to byte access if you have detected and stored that the string is entirely ASCII, but otherwise you will nearly always have multiple widths within one string. If the internal representation is UTF16, code point access can be accelerated for any string containing only BMP characters (no surrogate pairs). The Perl6 concept of NFG attempts to extend that advantage to grapheme access, and to points outside the BMP. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
hi, On Tue, Oct 21, 2014 at 4:01 PM, Dmitry Stogov dmi...@zend.com wrote: Hi Joe, As an extension it looks fine. I assume, you don't propose to use UString objects in engine and other extensions. Unfortunately, it's yet another incomplete solution. I have to agree here. As much as I like what has been done here, having UString as part of the engine or at least main/ may help tighter integration. I am also not sure about the driver approach (have to double check it again as I stopped following it since a couple of weeks). Having UString in the core is a great thing anyway. However there is no mention whether it should be always enabled or not. I think it should be always enabled, providing the base Unicode strings features by default. Having ICU as default dependency is not really an issue imho. We discussed that with Joe in the early UString days but we did not agree. Mainly because he likes to keep UString independent, unbloated etc. I think it is possible to keep it simple and having it tightly integrated in the core. Advanced features can be done either in intl or in userland (if we can avoid having every single project doing its own unicode string class... that would keep the performance improvement along other annoying APIs differences). It won't allow Unicode strings as array keys; concatenation using . (probably may be done), no auto-conversion from/to script/output encoding, no auto-conversion of strings coming from database extensions, etc The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). Fully agree here too. If we don't care about complete solution, UString proposal may make sense at lest as a faster replacement of ext/mbstring. I agree here too. For one I do care about a complete solution, for the basic Unicode features, integrated with the language. Thanks. Dmitry. On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php -- Pierre @pierrejoye | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21 October 2014 08:06, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Breaks nothing, faster than mbstring, seems like win/win to me. On the flip side, implementing UString as a scalar object would be inconsistent. At time of writing, array, int, float, bool, etc have no implementation available for this. I agree it shouldn't be a scalar object, but how about some operator overloading like the GMP object has, so that you don't have to cast to string for expected behaviour with type coercion etc. Right now there are user-space libraries out there that cover a lot more functionality than UString. Do you need help implementing these? Do you think it would be beneficial to briefly list which areas need attention on the RFC, so they can be checked off over time? Overall +1 on the concept. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] [RFC] UString
-Original Message- From: Joe Watkins [mailto:pthre...@pthreads.org] Sent: Tuesday, October 21, 2014 10:07 AM To: internals@lists.php.net Subject: [PHP-DEV] [RFC] UString Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) +1 from me. I think it's the right way to tackle Unicode. Zeev -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 08:40 +0100, Leigh wrote: On 21 October 2014 08:06, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Breaks nothing, faster than mbstring, seems like win/win to me. On the flip side, implementing UString as a scalar object would be inconsistent. At time of writing, array, int, float, bool, etc have no implementation available for this. I agree it shouldn't be a scalar object, but how about some operator overloading like the GMP object has, so that you don't have to cast to string for expected behaviour with type coercion etc. Right now there are user-space libraries out there that cover a lot more functionality than UString. Do you need help implementing these? Do you think it would be beneficial to briefly list which areas need attention on the RFC, so they can be checked off over time? Overall +1 on the concept. Morning Leigh, ZEND_CONCAT is overloaded, as well as read_dimension and cast (to string) handlers. This seems to cover everything, unless I missed something ? Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/14 08:06, Joe Watkins wrote: Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Does this address the problem of sorting array keys using a particular language or collation? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 09:02 +0100, Lester Caine wrote: On 21/10/14 08:06, Joe Watkins wrote: Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Does this address the problem of sorting array keys using a particular language or collation? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk No. Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
This is great thanks for the work! I think we should have an opinion on grapheme clusters and tell about it in the RFC. I do support the idea that PHP users need to handle characters in term of graphemes. We need a core way to deal with code points of course, but things like reverse have very low value without graphemes. toLower/toUpper also misses the turkish specifics - or is the Ustring class locale dependent? Should we add toCaseFold? Where are the i version of strpos, etc. Do we want them in core PHP7? An other point we should add to the RFC. For reference here is my grapheme cluster aware string handling: https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/Utf8.php and the same but turkish variant: https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php About unicode equivalence: For all the string matching functions (contains, startsWith, etc.) do they handling unicode equivalence? How do we compare two Ustrings? Does the == operator handle unicode equivalence? What is the way to go otherwise? Normalize is before on our own? The RFC should tell about it also IMHO (and tell that collation/sorting handling is out of scope). Complex topic :) Cheers, NIcolas
Re: [PHP-DEV] [RFC] UString
Hi Joe, As an extension it looks fine. I assume, you don't propose to use UString objects in engine and other extensions. Unfortunately, it's yet another incomplete solution. It won't allow Unicode strings as array keys; concatenation using . (probably may be done), no auto-conversion from/to script/output encoding, no auto-conversion of strings coming from database extensions, etc The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). If we don't care about complete solution, UString proposal may make sense at lest as a faster replacement of ext/mbstring. Thanks. Dmitry. On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21 October 2014 09:01, Joe Watkins pthre...@pthreads.org wrote: ZEND_CONCAT is overloaded, as well as read_dimension and cast (to string) handlers. This seems to cover everything, unless I missed something ? ZEND_CONCAT and ZEND_ASSIGN_CONCAT were my primary concerns, I didn't see any mention of these in the RFC which is why I brought it up (maybe it should be documented there). May not be desirable at all, but obviously with ordinary strings we can do `int + str containing int`, and if the UString object contains an int then `int + (string)ustring` will still achieve that. My thought was to make the remaining operators that don't make sense on an object implicitly cast to string before the operation takes place. Feel free to do not want. :) -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hello Tangentially related: On Tuesday, October 21, 2014, Dmitry Stogov dmi...@zend.com wrote: It won't allow Unicode strings as array keys; I wish there was a way for specific objects to opt into this. Using __toString() we have something that mostly behaves just like a string and can be used wherever a string is required - with the exception of array keys. I seem to remember some earlier discussion that led to this being intentionally made impossible (and I understand why), but maybe there could be support for another magic underscore method that's called when an object is about to be put into an array as a key (or similar situations) Philip -- Sensational AG Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich Tel. +41 43 544 09 60, Mobile +41 79 341 01 99 i...@sensational.ch, http://www.sensational.ch
Re: [PHP-DEV] [RFC] UString
Hi, @Philip: please read the discussion that happened a month ago (and follow up on it if necessary): http://marc.info/?l=php-internalsm=141145952422734w=2 Regards, On Tue, Oct 21, 2014 at 11:19 AM, Philip Hofstetter phofstet...@sensational.ch wrote: Hello Tangentially related: On Tuesday, October 21, 2014, Dmitry Stogov dmi...@zend.com wrote: It won't allow Unicode strings as array keys; I wish there was a way for specific objects to opt into this. Using __toString() we have something that mostly behaves just like a string and can be used wherever a string is required - with the exception of array keys. I seem to remember some earlier discussion that led to this being intentionally made impossible (and I understand why), but maybe there could be support for another magic underscore method that's called when an object is about to be put into an array as a key (or similar situations) Philip -- Sensational AG Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich Tel. +41 43 544 09 60, Mobile +41 79 341 01 99 i...@sensational.ch, http://www.sensational.ch -- Florian Margaine
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote: Hi Joe, As an extension it looks fine. I assume, you don't propose to use UString objects in engine and other extensions. I'm not proposing it now, no. Unfortunately, it's yet another incomplete solution. It won't allow Unicode strings as array keys; The engine doesn't allow that, couldn't we find a way of using objects as array keys ?? It doesn't seem like a limitation of the extension, to me ;) concatenation using . (probably may be done), That's already done. no auto-conversion from/to script/output encoding, That could be arranged. no auto-conversion of strings coming from database extensions, etc I'm not sure how important that is, it's not a big deal to create a new object, nor would it be a big deal for those extensions that need to always return unicode strings to do so. The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). That seems a lot like bashing our head against a wall. We tried to introduce support everywhere and it fails. Do we really want to step on the performance gains introduced by recent changes by making all strings unicode ? That doesn't seem like a sensible thing to want, at least right now. Having UString doesn't stop us approaching the problem differently in the future, but it would have to be a very different future to even make sense to me. If we don't care about complete solution, UString proposal may make sense at lest as a faster replacement of ext/mbstring. As the RFC states, we are only approaching one problem, the problem that ext/mbstring is not a good API. Thanks. Dmitry. On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Tue, Oct 21, 2014 at 1:25 PM, Joe Watkins pthre...@pthreads.org wrote: On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote: Hi Joe, As an extension it looks fine. I assume, you don't propose to use UString objects in engine and other extensions. I'm not proposing it now, no. Unfortunately, it's yet another incomplete solution. It won't allow Unicode strings as array keys; The engine doesn't allow that, couldn't we find a way of using objects as array keys ?? It doesn't seem like a limitation of the extension, to me ;) concatenation using . (probably may be done), That's already done. no auto-conversion from/to script/output encoding, That could be arranged. no auto-conversion of strings coming from database extensions, etc I'm not sure how important that is, it's not a big deal to create a new object, nor would it be a big deal for those extensions that need to always return unicode strings to do so. The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). That seems a lot like bashing our head against a wall. We tried to introduce support everywhere and it fails. Do we really want to step on the performance gains introduced by recent changes by making all strings unicode ? Yeah :) I'm not sure, if it should be done, and I don't like to work on it in the nearest future, but zend_string approach should be easier to implement than separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6. That doesn't seem like a sensible thing to want, at least right now. Having UString doesn't stop us approaching the problem differently in the future, but it would have to be a very different future to even make sense to me. Agree. If we don't care about complete solution, UString proposal may make sense at lest as a faster replacement of ext/mbstring. As the RFC states, we are only approaching one problem, the problem that ext/mbstring is not a good API. Then, it's fine. One note regarding implementation: why do you use C++ for ustring.cpp? I understand it's necessary for ICU backend, but if in the future you might switch to another backend (and it may not require C++) why to use C++ for PHP extension part? Thanks. Dmitry. Thanks. Dmitry. On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Cheers Joe
Re: [PHP-DEV] [RFC] UString
On Tue, 2014-10-21 at 13:52 +0400, Dmitry Stogov wrote: On Tue, Oct 21, 2014 at 1:25 PM, Joe Watkins pthre...@pthreads.org wrote: On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote: Hi Joe, As an extension it looks fine. I assume, you don't propose to use UString objects in engine and other extensions. I'm not proposing it now, no. Unfortunately, it's yet another incomplete solution. It won't allow Unicode strings as array keys; The engine doesn't allow that, couldn't we find a way of using objects as array keys ?? It doesn't seem like a limitation of the extension, to me ;) concatenation using . (probably may be done), That's already done. no auto-conversion from/to script/output encoding, That could be arranged. no auto-conversion of strings coming from database extensions, etc I'm not sure how important that is, it's not a big deal to create a new object, nor would it be a big deal for those extensions that need to always return unicode strings to do so. The right approach, would be extending zend_string with encoding and then adopting near all functions working with zend_string to take encoding into account. But, of course, this is going to lead to much more complicated solution (with some slowdown). That seems a lot like bashing our head against a wall. We tried to introduce support everywhere and it fails. Do we really want to step on the performance gains introduced by recent changes by making all strings unicode ? Yeah :) You must like punishment :D I'm not sure, if it should be done, and I don't like to work on it in the nearest future, but zend_string approach should be easier to implement than separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6. The implementation might be simpler, but the effect the same I think. I can be wrong, but nothing has so drastically changed that will allow us to absorb the kind of impact I think you are talking about. That doesn't seem like a sensible thing to want, at least right now. Having UString doesn't stop us approaching the problem differently in the future, but it would have to be a very different future to even make sense to me. Agree. If we don't care about complete solution, UString proposal may make sense at lest as a faster replacement of ext/mbstring. As the RFC states, we are only approaching one problem, the problem that ext/mbstring is not a good API. Then, it's fine. One note regarding implementation: why do you use C++ for ustring.cpp? I understand it's necessary for ICU backend, but if in the future you might switch to another backend (and it may not require C++) why to use C++ for PHP extension part? Totally possible that we'll have to change, or that we should change. A few people have said they would like to write a backend so we'll see what comes in and where that leads us. Thanks. Dmitry. Thanks. Dmitry. On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php Cheers Joe Cheers Joe -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/14 10:52, Dmitry Stogov wrote: That seems a lot like bashing our head against a wall. We tried to introduce support everywhere and it fails. Do we really want to step on the performance gains introduced by recent changes by making all strings unicode ? Yeah :) I'm not sure, if it should be done, and I don't like to work on it in the nearest future, but zend_string approach should be easier to implement than separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6. Isn't this the first discussion? If we are going down the root of keeping PHP7 as ascii only in the core, then ustring probably makes sense, but it does not address many of the areas where unicode is really needed. Handling unicode content outside the core is working reasonably at the moment, it is the problems such as using unicode keys for arrays which is the main area where unicoe is needed in PHP7 and so a more embedded handling is needed which may cut across yet another content wrapper? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/2014 09:06, Joe Watkins wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Nice job! However, doesn't ICU use UTF-16 by default which is undesirable as most of the times it requires converting from and to UTF-8? Cheers -- Matteo Beccati Development Consulting - http://www.beccati.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Lester Caine wrote (on 21/10/2014): If we are going down the root of keeping PHP7 as ascii only in the core, then ustring probably makes sense, but it does not address many of the areas where unicode is really needed. Just a quick point: most of the core is not ASCII. PHP strings are byte strings, completely divorced from any encoding. A few native functions assume ISO8859-1 (or possibly Windows CP1252), but mostly they just juggle which ever bytes you give them. The main exception I can think of is that numbers are often handled specially, with digits and separators as defined by ASCII. But since we're talking UTF-8, that doesn't need to change. Handling unicode content outside the core is working reasonably at the moment, it is the problems such as using unicode keys for arrays which is the main area where unicoe is needed in PHP7 and so a more embedded handling is needed which may cut across yet another content wrapper? I do think this is an important thing to consider, though. If this extension is genuinely just meant as a more modern and more performant way of doing things which mbstring and intl can already do, that needs to be clear in the way it's documented and publicised. If this gets publicised as better Unicode support, users are naturally going to expect UString objects to start appearing in core, and in other extensions, and be disappointed that it's still just a toolbox for their own string handling. -- Rowan Collins [IMSoP] -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/14 12:11, Rowan Collins wrote: Lester Caine wrote (on 21/10/2014): If we are going down the root of keeping PHP7 as ascii only in the core, then ustring probably makes sense, but it does not address many of the areas where unicode is really needed. Just a quick point: most of the core is not ASCII. PHP strings are byte strings, completely divorced from any encoding. A few native functions assume ISO8859-1 (or possibly Windows CP1252), but mostly they just juggle which ever bytes you give them. The main exception I can think of is that numbers are often handled specially, with digits and separators as defined by ASCII. But since we're talking UTF-8, that doesn't need to change. Pierre had proposed restricting that to ascii as a way of addressing the inconsistencies that arise because some areas do not currently make a distinction. Handling unicode content outside the core is working reasonably at the moment, it is the problems such as using unicode keys for arrays which is the main area where unicoe is needed in PHP7 and so a more embedded handling is needed which may cut across yet another content wrapper? I do think this is an important thing to consider, though. If this extension is genuinely just meant as a more modern and more performant way of doing things which mbstring and intl can already do, that needs to be clear in the way it's documented and publicised. If this gets publicised as better Unicode support, users are naturally going to expect UString objects to start appearing in core, and in other extensions, and be disappointed that it's still just a toolbox for their own string handling. This is where a proper discussion on just what is trying to be achieved is important, before discussing tangents? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk Rainbow Digital Media - http://rainbowdigitalmedia.co.uk -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Am 21.10.2014 um 09:06 schrieb Joe Watkins pthre...@pthreads.org: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. I have one concern I want to bring up: The RFC proposes a helper function u() to generate UStrings. As this is a very handy function name for all sort of utility functions (as a matter of face we use it to create and sanitize URL strings to be embedded into HTML) I would assume that more than one project has a name clash there. Maybe something like _u() could be used instead? Or do you have better alternatives for this? PS: UString is also in the global name space but should be less of a problem I'd imagine. Regards, - Chris -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21 October 2014 14:35, Christian Schneider cschn...@cschneid.com wrote: Am 21.10.2014 um 09:06 schrieb Joe Watkins pthre...@pthreads.org: https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. I have one concern I want to bring up: The RFC proposes a helper function u() to generate UStrings. As this is a very handy function name for all sort of utility functions (as a matter of face we use it to create and sanitize URL strings to be embedded into HTML) I would assume that more than one project has a name clash there. Maybe something like _u() could be used instead? Or do you have better alternatives for this? PS: UString is also in the global name space but should be less of a problem I'd imagine. With the use function support, that could be located in a namespace. But something else: wasn't there a big concern in another thread regarding codepoint/grapheme support, like with $ustring-length()? -- Regards, Mike
Re: [PHP-DEV] [RFC] UString
On 21 Oct 2014, at 13:35, Christian Schneider cschn...@cschneid.com wrote: I have one concern I want to bring up: The RFC proposes a helper function u() to generate UStrings. As this is a very handy function name for all sort of utility functions (as a matter of face we use it to create and sanitize URL strings to be embedded into HTML) I would assume that more than one project has a name clash there. Maybe something like _u() could be used instead? Or do you have better alternatives for this? PS: UString is also in the global name space but should be less of a problem I'd imagine. I think we should reserve some way to do Unicode strings. I’d want u”foo”, but we’re not adding literals, so u(“foo”) it is. Also, bear in mind that namespaces mean you can still have your own u() if it’s in your namespace (\u). -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
So, one thing which I think is worth bringing up is code points vs. characters/graphemes. This came up in another recent thread about Unicode on internals. While code-point manipulation is all well and good, we also need grapheme manipulation functions. Could we add these? That would make the API more useful. On that note, -charAt ought to be -codepointAt to avoid being misleading. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/2014 15:17, Lester Caine wrote: On 21/10/14 11:50, Matteo Beccati wrote: However, doesn't ICU use UTF-16 by default which is undesirable as most of the times it requires converting from and to UTF-8? http:// userguide.icu-project.org/strings/utf-8 It is interesting that the earlier adoption of UTF-16 still prevails, but switching to UTF-8 is becoming the norm? Yes, as far as I knew using UTF-8 by default was a compile-time option for ICU, that most of the times comes from system packages. Cheers -- Matteo Beccati Development Consulting - http://www.beccati.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On Oct 21, 2014, at 0:06, Joe Watkins pthre...@pthreads.org wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. The backend abstraction seems overengineered to me. It could also lead to inconsistencies in behavior if ICU and Windows implement something in subtly different ways. Since we're linking ICU for the rest of the intl extension anyway, it seems to me like we should just focus on it as an ICU wrapper. Also, I'd peopose a minor ammendment to this RFC that other intl classes be extended to support taking UString instances as arguments (avoiding the implicit conversion to UTF8). That work doesn't have to gate adoption of the base implementation, it'd just be useful to decide at the same time if we want to do so. -Sara -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi! https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Couple of thoughts: - I like the idea of having a unicode string class. May be a way to figure out the right way to do it without messing up the whole core. - I wish there were more description of which API this class provides. If it's planned to be direct copy of UnicodeString, some of the operations there are not how PHP strings usually work (i.e. in-place modification) and it's not really enough to make it useful - e.g. what if I need to do regexps on it, for example? Or does it cover whole mbstring API too? What about something mbstring doesn't cover, like ucfirst or strrev? - Do we really need different encodings, different backends and so on, internally? Note that each backend has its own quirks, limitations and bugs, and there's nothing worse than dealing with unpredictable set of dependencies. The user cares what they send into the class and what comes out, but very rarely they care what happens inside - why not just do it one way everywhere? -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi! I wish there was a way for specific objects to opt into this. There will be, if __hashKey() or whatever would be the properly bikeshedded name, becomes reality as discussed elsewhere. It shouldn't be hard to do and it's exactly what many other languages do when trying to use objects as keys for maps. -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
Hi! Just a quick point: most of the core is not ASCII. PHP strings are byte strings, completely divorced from any encoding. A few native functions assume ISO8859-1 (or possibly Windows CP1252), but mostly they just juggle which ever bytes you give them. True, but not all extensions and functions behave this way. Some (especially with intl, but not only) assume it's utf-8, for example, and for some utf-8 is a changeable default, which in practice often becomes the used encoding since people are not aware of need to track their encoding and most of them do use utf-8 anyway. The main exception I can think of is that numbers are often handled specially, with digits and separators as defined by ASCII. But since we're talking UTF-8, that doesn't need to change. More interesting case actually is, well, case conversion. We unknowingly used locale-dependent lowercasing routines until the inevitable encounter with the dreaded Turkish 'i'. At which point we switched to forced ASCII. So identifiers in the engine are kind of assumed to be ASCII, even though you can somethimes sneak non-ASCII past it and it will work, but weirdly. -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] [RFC] UString
On 21/10/2014 08:06, Joe Watkins wrote: Morning internalz, https://wiki.php.net/rfc/ustring This is the result of work done by a few of us, we won't be opening any vote in a fortnight. We have a long time before 7, there is no rush whatever. Now seems like a good time to start the conversation so we can hash out the details, or get on with other things ;) Cheers Joe I think this looks like a really great start at creating something actually useful, rather than getting stuck at the drawing board. I like that the scope is quite small initially - where does the single responsibility of a class that represents a string end, anyway? :) A few opinions: 1) Global / static defaults are bad. The existence of the setDefaultCodepage method feels like an anti-pattern to me. It means libraries can't rely on this class working the same way in two different host environments, or even at two re-entries in the same program. Effectively, if you don't know what the second argument to the constructor will default to, you can't actually treat it as optional unless you're writing monolithic code. This is a common pattern in PHP, but http_build_query() would be so much more pleasant if I could safely call it with 1 argument instead of 3. I think the default should be hard-coded to UTF-8, which according to previous discussion is always the default *output* encoding, so would mean this would always work: $aUString = new UString( (string)$aUString ); Any other encoding will be dependent on, and known from, the context where the object is created - if grabbing data from an HTTP request, a header should tell them; if from a database, a connection parameter; and so on. The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected. I guess if we ever went down the route of special literal syntax like u'literal string', the declared source encoding could be used. Actually, the u() shortcut function appears to be missing the encoding parameter completely; is this deliberate? 2) Clarify relationship to a byte string Most of the API acts like this is an abstract object representing a bunch of Unicode code points. As such, I'm not sure what getCodepage() does - a code page (or more properly encoding) is a property of a stream of bytes, so has no meaning in this context, surely? The internal implementation could use UTF-8, UTF-16, or some made-up encoding (like Perl6's NFG system) and the user should never need to know (other than to understand performance implications). On the other hand, when you *do* want a stream of bytes, the class doesn't seem to have an explicit way to get one. The (currently undocumented) behaviour is apparently to spit out UTF-8 if cast to a string, but it would be nice to have an explicit function which could be passed a parameter in order to serialise to, say, UTF-16, instead. 3) The Grapheme Question This has been raised a few times, so I won't labour the point, just mention my current thinking. Unicode is complicated. Partly, that's because of a series of compromises in its design; but partly, it's because writing systems are complicated, and Unicode tries harder than most previous systems to acknowledge that. So, there's a tradeoff to be made between giving users what they think they need, thus hiding the messy details, and giving users the power to do things right, in a more complex way. There is also a namespace mess if you insist on every function and property having to declare what level of abstraction it's talking about - e.g. $codePointLength instead of $length. An idea I've been toying with is rather than having one class representing the slippery notion of a Unicode string, having (at least) two, closely tied, classes: CodePointString (roughly = UString right now) and GraphemeString (a higher level abstraction tied to the same internal representation). I intend to mock this up as a set of interfaces at some point, but the basic idea is that you could write this: // Get an abstract object from a byte string, probably a GraphemeString, parsing the input as UTF-8 $str = u('some text'); // Perform an operation that explicitly deals in Code Points $str = $str-asCodePoints()-normalise('NFC'); // Get information using a higher level of abstraction $length = $str-asGraphemes()-length; // Perform a high-level mutation, then convert right back to a concrete string of bytes echo $str-asGraphemes()-reverse()-asByteString('UTF-16'); Calling asGraphemes() on a GraphemeString or asCodePoints() on a CodePointString would be legal but a no-op, so it would be safe to accept both as input to a function, then switch to whichever level the task required. I'm not sure if this finds a good balance between complexity and user-friendliness, and would welcome anyone's thoughts. -- Rowan Collins
Re: [PHP-DEV] [RFC] UString
On 21 Oct 2014, at 21:42, Rowan Collins rowan.coll...@gmail.com wrote: The only case I can see where a default encoding would be sensible would be where source code itself is in a different encoding, so that u('literal string') works as expected. This is only a good idea if we can somehow make it file-local. Otherwise if one library uses Latin-1 and another uses UTF-8 for some reason, bang! 2) Clarify relationship to a byte string Most of the API acts like this is an abstract object representing a bunch of Unicode code points. As such, I'm not sure what getCodepage() does - a code page (or more properly encoding) is a property of a stream of bytes, so has no meaning in this context, surely? The internal implementation could use UTF-8, UTF-16, or some made-up encoding (like Perl6's NFG system) and the user should never need to know (other than to understand performance implications). On the other hand, when you *do* want a stream of bytes, the class doesn't seem to have an explicit way to get one. The (currently undocumented) behaviour is apparently to spit out UTF-8 if cast to a string, but it would be nice to have an explicit function which could be passed a parameter in order to serialise to, say, UTF-16, instead. I agree on both these points. -toBytes or -encode with an explicit charset parameter would be good. I don’t see the point of getCodepage(). 3) The Grapheme Question This has been raised a few times, so I won't labour the point, just mention my current thinking. Unicode is complicated. Partly, that's because of a series of compromises in its design; but partly, it's because writing systems are complicated, and Unicode tries harder than most previous systems to acknowledge that. So, there's a tradeoff to be made between giving users what they think they need, thus hiding the messy details, and giving users the power to do things right, in a more complex way. There is also a namespace mess if you insist on every function and property having to declare what level of abstraction it's talking about - e.g. $codePointLength instead of $length. An idea I've been toying with is rather than having one class representing the slippery notion of a Unicode string, having (at least) two, closely tied, classes: CodePointString (roughly = UString right now) and GraphemeString (a higher level abstraction tied to the same internal representation). I intend to mock this up as a set of interfaces at some point, but the basic idea is that you could write this: // Get an abstract object from a byte string, probably a GraphemeString, parsing the input as UTF-8 $str = u('some text'); // Perform an operation that explicitly deals in Code Points $str = $str-asCodePoints()-normalise('NFC'); // Get information using a higher level of abstraction $length = $str-asGraphemes()-length; // Perform a high-level mutation, then convert right back to a concrete string of bytes echo $str-asGraphemes()-reverse()-asByteString('UTF-16'); Calling asGraphemes() on a GraphemeString or asCodePoints() on a CodePointString would be legal but a no-op, so it would be safe to accept both as input to a function, then switch to whichever level the task required. I'm not sure if this finds a good balance between complexity and user-friendliness, and would welcome anyone's thoughts. I’d rather have some grapheme-specific functions and some code point functions on the same class. Make array-like indexing with [] be by code points as you may be able to do that in constant time, and because there might be multiple approaches to choosing graphemes. Have -codepointAt(), but also -nthGrapheme() or something like it. There’s no need for grapheme versions of all functions, but others would need them. Though your approach has its own merits. -- Andrea Faulds http://ajf.me/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php