Re: [PHP-DEV] [RFC] UString

2015-07-02 Thread Andreas Heigl
Hi.

Am 02.07.15 um 15:43 schrieb Ivan Enderlin@Hoa:
 Hello :-),
 
 Just a small detail. Please, choose another name. The `Hoa\String`
 https://packagist.org/packages/hoa/string library has been renamed to
 `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the
 library again ;-).

What's the issue with the name?

As far as I see it, There's no problem at all, as there's UString and
then there's Hoa\UString. Different namespace, no issue.

Or am I missing something?

Cheers

Andreas

 
 Moreover, this library provides an API that is useful for daily use and
 can be inspiring. Please, see
 http://hoa-project.net/Literature/Hack/Ustring.html.
 
 Regards.
 
 On 01/07/15 01:30, Sara Golemon wrote:
 On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com
 wrote:
 On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org
 wrote:
  https://wiki.php.net/rfc/ustring

  This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

  Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)

 Curious what the current state of the UString RFC is.  I've got a
 functionality request for HHVM to wrap icu::UnicodeString and was
 hoping to match PHP behavior if any plans had been made, and lo...
 here's a plan!

 I'm not totally convinced by this proposal. We already have quite a
 number
 of extensions that deal with unicode text in one way or another (at
 least
 intl, mbstring and iconv). This adds yet another way of dealing with
 this
 issue - a way that will have to be combined with at least two other
 extensions (mbstring or iconv for input handling and conversion) and
 intl
 for any non-trivial operations. There's nothing wrong with adding
 another
 approach for unicode handling per se, but I'd like to have more
 empahsis on
 how this integrates with existing functionality and why it is
 implemented
 separately from it (especially intl), etc.

 I think (hope) that Joe's intention was to make it as an extension for
 proof of concept, but make it part of the intl extension when it comes
 to full adoption by the runtime.  If not, let's talk about making that
 the intent, because intl is where this belongs.

 For my bikeshedding part, I'd recommend against the u() function
 helper as it pollutes the global function namespace and takes a very
 fundamental name.  intl\u() might be worth considering, but we'll need
 to have a discussion about namespacing for the intl extension as a
 whole (separate topic).

 I'd also recommend IntlString rather than UString as nearly all
 the Intl classes follow this convention.  The one notable exception
 being UConverter (which yes, I added, and I regret the departure in
 naming).

 Otherwise, while there's room to quibble about specific API names and
 arguments, the general concept seems a no-brainer.

 -Sara

 
 


-- 
  ,,,
 (o o)
+-ooO-(_)-Ooo-+
| Andreas Heigl   |
| mailto:andr...@heigl.org  N 50°22'59.5 E 08°23'58 |
| http://andreas.heigl.org   http://hei.gl/wiFKy7 |
+-+
| http://hei.gl/root-ca   |
+-+



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PHP-DEV] [RFC] UString

2015-07-02 Thread Ivan Enderlin@Hoa

I fear it will be a reserved keyword.

On 02/07/15 15:46, Andreas Heigl wrote:

Hi.

Am 02.07.15 um 15:43 schrieb Ivan Enderlin@Hoa:

Hello :-),

Just a small detail. Please, choose another name. The `Hoa\String`
https://packagist.org/packages/hoa/string library has been renamed to
`Hoa\Ustring` because of PHP7. So, please, don't force us to rename the
library again ;-).

What's the issue with the name?

As far as I see it, There's no problem at all, as there's UString and
then there's Hoa\UString. Different namespace, no issue.

Or am I missing something?

Cheers

Andreas


Moreover, this library provides an API that is useful for daily use and
can be inspiring. Please, see
http://hoa-project.net/Literature/Hack/Ustring.html.

Regards.

On 01/07/15 01:30, Sara Golemon wrote:

On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com
wrote:

On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org
wrote:

  https://wiki.php.net/rfc/ustring

  This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

  Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)


Curious what the current state of the UString RFC is.  I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!


I'm not totally convinced by this proposal. We already have quite a
number
of extensions that deal with unicode text in one way or another (at
least
intl, mbstring and iconv). This adds yet another way of dealing with
this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and
intl
for any non-trivial operations. There's nothing wrong with adding
another
approach for unicode handling per se, but I'd like to have more
empahsis on
how this integrates with existing functionality and why it is
implemented
separately from it (especially intl), etc.


I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime.  If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name.  intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend IntlString rather than UString as nearly all
the Intl classes follow this convention.  The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara








--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-02 Thread Ivan Enderlin@Hoa

Hello :-),

Just a small detail. Please, choose another name. The `Hoa\String` 
https://packagist.org/packages/hoa/string library has been renamed to 
`Hoa\Ustring` because of PHP7. So, please, don't force us to rename the 
library again ;-).


Moreover, this library provides an API that is useful for daily use and 
can be inspiring. Please, see 
http://hoa-project.net/Literature/Hack/Ustring.html.


Regards.

On 01/07/15 01:30, Sara Golemon wrote:

On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote:

On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote:

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

 Now seems like a good time to start the conversation so we can
hash out
the details, or get on with other things ;)


Curious what the current state of the UString RFC is.  I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!


I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.


I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime.  If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name.  intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend IntlString rather than UString as nearly all
the Intl classes follow this convention.  The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara




--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-02 Thread Kalle Sommer Nielsen
Hi Ivan

2015-07-02 15:48 GMT+02:00 Ivan Enderlin@Hoa ivan.ender...@hoa-project.net:
 I fear it will be a reserved keyword.

Internally defined classes, such as UConverter or stdClass are not
reserved keywords, they are not an actual part of the language but a
part of the library. Code like the one below is perfectly valid,
meaning the example you made will continue to work as long it remains
within a namespace:

C:\dev\php-srcphp -r namespace stdlib; class stdclass { }
var_dump(get_class(new stdclass), get_class(new \stdClass));
string(15) stdlib\stdclass
string(8) stdClass


-- 
regards,

Kalle Sommer Nielsen
ka...@php.net

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-02 Thread Sara Golemon
On Thu, Jul 2, 2015 at 6:43 AM, Ivan Enderlin@Hoa
ivan.ender...@hoa-project.net wrote:
 Just a small detail. Please, choose another name. The `Hoa\String`
 https://packagist.org/packages/hoa/string library has been renamed to
 `Hoa\Ustring` because of PHP7. So, please, don't force us to rename the
 library again ;-).

As replied by others, no need for concern on that front.  As \UString
and Hoa\UString can live side-by-side.

However, I would like to bump my earlier suggestion to go with
IntlString and make this functionality be part of the intl
extension.

 I'd also recommend IntlString rather than UString as nearly all
 the Intl classes follow this convention.  The one notable exception
 being UConverter (which yes, I added, and I regret the departure in
 naming).


-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-01 Thread Sara Golemon
On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org wrote:
 Another possible issue is engine integration:

 $string = (UString) $someString;
 $string = (UString) someString;

That sounds as a cool idea to discuss as a completely separate,
unrelated RFC, and not specific to UString.

e.g.   $obj = (ClassName)$arg;   /* turns into */ $obj = new ClassName($arg);

So you could use casting with any class which supports single-argument
constructors.

But again, orthogonal to this RFC.

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-01 Thread Aaron Piotrowski

 On Jul 1, 2015, at 1:06 PM, Sara Golemon poll...@php.net wrote:
 
 On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org wrote:
 Another possible issue is engine integration:
 
$string = (UString) $someString;
$string = (UString) someString;
 
 That sounds as a cool idea to discuss as a completely separate,
 unrelated RFC, and not specific to UString.
 
 e.g.   $obj = (ClassName)$arg;   /* turns into */ $obj = new ClassName($arg);
 
 So you could use casting with any class which supports single-argument
 constructors.
 
 But again, orthogonal to this RFC.
 
 -Sara
 
 -- 
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php
 

Expanding on this idea, a separate RFC could propose a magic __cast($value) 
static method that would be called for code like below:

$obj = (ClassName) $scalarOrObject; // Invokes 
ClassName::__cast($scalarOrObject);

This would allow UString to implement casting a string to a UString and allow 
users to implement such behavior with their own classes.

However, I would not implement such casting syntax for UString only. Being able 
to write $ustring = (UString) $string; without the ability to do so for other 
classes would be unusual and confusing in my opinion. If an RFC adding such 
behavior was implemented, UString could be updated to support casting.

Obviously a UString should be able to be cast to a scalar string using (string) 
$ustring. If performance is a concern, UString::__toString() should cache the 
result so multiple casts to the same object are quick.

Aaron Piotrowski
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] [RFC] UString

2015-07-01 Thread Anatol Belski
Hi,

 -Original Message-
 From: Aaron Piotrowski [mailto:aa...@icicle.io]
 Sent: Wednesday, July 1, 2015 9:00 PM
 To: Sara Golemon
 Cc: pthre...@pthreads.org; internals@lists.php.net
 Subject: Re: [PHP-DEV] [RFC] UString
 
 
  On Jul 1, 2015, at 1:06 PM, Sara Golemon poll...@php.net wrote:
 
  On Tue, Jun 30, 2015 at 10:36 PM, Joe Watkins pthre...@pthreads.org
 wrote:
  Another possible issue is engine integration:
 
 $string = (UString) $someString;
 $string = (UString) someString;
 
  That sounds as a cool idea to discuss as a completely separate,
  unrelated RFC, and not specific to UString.
 
  e.g.   $obj = (ClassName)$arg;   /* turns into */ $obj = new
ClassName($arg);
 
  So you could use casting with any class which supports single-argument
  constructors.
 
  But again, orthogonal to this RFC.
 
  -Sara
 
  --
  PHP Internals - PHP Runtime Development Mailing List To unsubscribe,
  visit: http://www.php.net/unsub.php
 
 
 Expanding on this idea, a separate RFC could propose a magic
__cast($value)
 static method that would be called for code like below:
 
 $obj = (ClassName) $scalarOrObject; // Invokes
 ClassName::__cast($scalarOrObject);
 
 This would allow UString to implement casting a string to a UString and
allow
 users to implement such behavior with their own classes.
 
 However, I would not implement such casting syntax for UString only. Being
able
 to write $ustring = (UString) $string; without the ability to do so for
other classes
 would be unusual and confusing in my opinion. If an RFC adding such
behavior
 was implemented, UString could be updated to support casting.
 
 Obviously a UString should be able to be cast to a scalar string using
(string)
 $ustring. If performance is a concern, UString::__toString() should cache
the
 result so multiple casts to the same object are quick.
 
One way doing this is already there thanks
https://wiki.php.net/rfc/operator_overloading_gmp . Consider

$n = gmp_init(42); var_dump($n, (int)$n);

However the other way round - could be done on case by case basis, IMHO.
Where it could make sense for class vs scalar, casting class to class is a
quite unpredictable thing.

While users could implement it, how is it handled with arbitrary objects?
How would it map properties, would those classes need to implement the same
interface, et cetera? We're not in C at this point, where we would just
force a block of memory to be interpreted as we want.

Regards

Anatol



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-01 Thread Aaron Piotrowski

 On Jul 1, 2015, at 2:25 PM, Anatol Belski anatol@belski.net wrote:
 
 Expanding on this idea, a separate RFC could propose a magic
 __cast($value)
 static method that would be called for code like below:
 
 $obj = (ClassName) $scalarOrObject; // Invokes
 ClassName::__cast($scalarOrObject);
 
 This would allow UString to implement casting a string to a UString and
 allow
 users to implement such behavior with their own classes.
 
 However, I would not implement such casting syntax for UString only. Being
 able
 to write $ustring = (UString) $string; without the ability to do so for
 other classes
 would be unusual and confusing in my opinion. If an RFC adding such
 behavior
 was implemented, UString could be updated to support casting.
 
 Obviously a UString should be able to be cast to a scalar string using
 (string)
 $ustring. If performance is a concern, UString::__toString() should cache
 the
 result so multiple casts to the same object are quick.


 Hi,
 
 One way doing this is already there thanks
 https://wiki.php.net/rfc/operator_overloading_gmp . Consider
 
 $n = gmp_init(42); var_dump($n, (int)$n);
 
 However the other way round - could be done on case by case basis, IMHO.
 Where it could make sense for class vs scalar, casting class to class is a
 quite unpredictable thing.
 
 While users could implement it, how is it handled with arbitrary objects?
 How would it map properties, would those classes need to implement the same
 interface, et cetera? We're not in C at this point, where we would just
 force a block of memory to be interpreted as we want.
 
 Regards
 
 Anatol

Hello,

I was thinking that the __cast() static method would examine the parameter 
given, then use that value to build a new object and return it or return null 
(which would then result in the engine throwing an Error saying that 
$scalarOrValue could not be cast to ClassName). It was just a suggestion to see 
what others thought because someone suggested supporting casting syntax such as 
$ustring = (UString) $scalarString. I don’t really care for either method 
though (__cast() or enabling casting just for UString), as they don't offer any 
advantage over writing new UString($string) or UString::fromString($string).

Aaron Piotrowski
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-07-01 Thread Andreas Heigl
Hi Joe.

Am 01.07.15 um 07:36 schrieb Joe Watkins:
 [..]
 
 Another possible issue is engine integration:
 
 $string = (UString) $someString;
 $string = (UString) someString;
 
 These aren't very different to 'new UString', but for an integrated
 solution, kind of expected to work.

Why would that be expected behaviour? I mean I can't do

$date = (DateTime) $timestring;

after all, can I? But I can use

$date = new DateTime($timestring);

Just my 2 Cent.

Cheers

Andreas
-- 
  ,,,
 (o o)
+-ooO-(_)-Ooo-+
| Andreas Heigl   |
| mailto:andr...@heigl.org  N 50°22'59.5 E 08°23'58 |
| http://andreas.heigl.org   http://hei.gl/wiFKy7 |
+-+
| http://hei.gl/root-ca   |
+-+



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PHP-DEV] [RFC] UString

2015-07-01 Thread Joe Watkins
Morning,

 Why would that be expected behaviour? I mean I can't do

$date = (DateTime) $timestring;

No, but you can't do:

 $string = (string) $datetime;

But can do:

$string = (string) $ustring;

Where $ustring is instanceof UString.

Even if you never write $string = (string) $ustring, the engine will
perform the same
action all the time, whenever you pass a UString to anything expecting
string.

It feels like a complete implementation should support both casts.

Cheers
Joe

On Wed, Jul 1, 2015 at 7:38 AM, Andreas Heigl andr...@heigl.org wrote:

 Hi Joe.

 Am 01.07.15 um 07:36 schrieb Joe Watkins:
  [..]
 
  Another possible issue is engine integration:
 
  $string = (UString) $someString;
  $string = (UString) someString;
 
  These aren't very different to 'new UString', but for an integrated
  solution, kind of expected to work.

 Why would that be expected behaviour? I mean I can't do

 $date = (DateTime) $timestring;

 after all, can I? But I can use

 $date = new DateTime($timestring);

 Just my 2 Cent.

 Cheers

 Andreas
 --
   ,,,
  (o o)
 +-ooO-(_)-Ooo-+
 | Andreas Heigl   |
 | mailto:andr...@heigl.org  N 50°22'59.5 E 08°23'58 |
 | http://andreas.heigl.org   http://hei.gl/wiFKy7 |
 +-+
 | http://hei.gl/root-ca   |
 +-+




Re: [PHP-DEV] [RFC] UString

2015-06-30 Thread Joe Watkins
Morning Sara,

 Curious what the current state of the UString RFC is.  I've got a
 functionality request for HHVM to wrap icu::UnicodeString and was
 hoping to match PHP behavior if any plans had been made, and lo...
 here's a plan!

I was (semi) convinced by Dmitry that the superior implementation is one
for Zend, so I backed off ...

 I think (hope) that Joe's intention was to make it as an extension for
 proof of concept, but make it part of the intl extension when it comes
 to full adoption by the runtime.  If not, let's talk about making that
 the intent, because intl is where this belongs.

The folder the source code is in makes no nevermind, the real issue with
integration
is changing all of intl, and lots of other stuff, to accept UString, since
casting to basic type
, while acceptable for simple tests, would get extremely wasteful for an
application of any complexity.

Another possible issue is engine integration:

$string = (UString) $someString;
$string = (UString) someString;

These aren't very different to 'new UString', but for an integrated
solution, kind of expected to work.

I don't know what the solutions are to these problems, I'm all ears ...

Cheers
Joe

On Wed, Jul 1, 2015 at 12:30 AM, Sara Golemon poll...@php.net wrote:

 On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com
 wrote:
  On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org
 wrote:
  https://wiki.php.net/rfc/ustring
 
  This is the result of work done by a few of us, we won't be
  opening any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.
 
  Now seems like a good time to start the conversation so we can
  hash out
  the details, or get on with other things ;)
 
 
 Curious what the current state of the UString RFC is.  I've got a
 functionality request for HHVM to wrap icu::UnicodeString and was
 hoping to match PHP behavior if any plans had been made, and lo...
 here's a plan!

  I'm not totally convinced by this proposal. We already have quite a
 number
  of extensions that deal with unicode text in one way or another (at least
  intl, mbstring and iconv). This adds yet another way of dealing with this
  issue - a way that will have to be combined with at least two other
  extensions (mbstring or iconv for input handling and conversion) and intl
  for any non-trivial operations. There's nothing wrong with adding another
  approach for unicode handling per se, but I'd like to have more empahsis
 on
  how this integrates with existing functionality and why it is implemented
  separately from it (especially intl), etc.
 
 I think (hope) that Joe's intention was to make it as an extension for
 proof of concept, but make it part of the intl extension when it comes
 to full adoption by the runtime.  If not, let's talk about making that
 the intent, because intl is where this belongs.

 For my bikeshedding part, I'd recommend against the u() function
 helper as it pollutes the global function namespace and takes a very
 fundamental name.  intl\u() might be worth considering, but we'll need
 to have a discussion about namespacing for the intl extension as a
 whole (separate topic).

 I'd also recommend IntlString rather than UString as nearly all
 the Intl classes follow this convention.  The one notable exception
 being UConverter (which yes, I added, and I regret the departure in
 naming).

 Otherwise, while there's room to quibble about specific API names and
 arguments, the general concept seems a no-brainer.

 -Sara



Re: [PHP-DEV] [RFC] UString

2015-06-30 Thread Sara Golemon
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote:
 On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote:
 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)


Curious what the current state of the UString RFC is.  I've got a
functionality request for HHVM to wrap icu::UnicodeString and was
hoping to match PHP behavior if any plans had been made, and lo...
here's a plan!

 I'm not totally convinced by this proposal. We already have quite a number
 of extensions that deal with unicode text in one way or another (at least
 intl, mbstring and iconv). This adds yet another way of dealing with this
 issue - a way that will have to be combined with at least two other
 extensions (mbstring or iconv for input handling and conversion) and intl
 for any non-trivial operations. There's nothing wrong with adding another
 approach for unicode handling per se, but I'd like to have more empahsis on
 how this integrates with existing functionality and why it is implemented
 separately from it (especially intl), etc.

I think (hope) that Joe's intention was to make it as an extension for
proof of concept, but make it part of the intl extension when it comes
to full adoption by the runtime.  If not, let's talk about making that
the intent, because intl is where this belongs.

For my bikeshedding part, I'd recommend against the u() function
helper as it pollutes the global function namespace and takes a very
fundamental name.  intl\u() might be worth considering, but we'll need
to have a discussion about namespacing for the intl extension as a
whole (separate topic).

I'd also recommend IntlString rather than UString as nearly all
the Intl classes follow this convention.  The one notable exception
being UConverter (which yes, I added, and I regret the departure in
naming).

Otherwise, while there's room to quibble about specific API names and
arguments, the general concept seems a no-brainer.

-Sara

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-02 Thread Nikita Popov
On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote:

 Morning internalz,

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)


I'm not totally convinced by this proposal. We already have quite a number
of extensions that deal with unicode text in one way or another (at least
intl, mbstring and iconv). This adds yet another way of dealing with this
issue - a way that will have to be combined with at least two other
extensions (mbstring or iconv for input handling and conversion) and intl
for any non-trivial operations. There's nothing wrong with adding another
approach for unicode handling per se, but I'd like to have more empahsis on
how this integrates with existing functionality and why it is implemented
separately from it (especially intl), etc.

On a more general note, I'd appreciate it if RFCs proposing the inclusion
of extensions moved more of their content into the actual RFC, as opposed
to being thin wrappers around the extension README/docs. We had this issue
with the pecl_http RFC and the same applies here. I think the suggested API
is a pretty important aspect of the proposal and as such should be included
in the RFC and maybe also commented a bit ;)

Nikita


Re: [PHP-DEV] [RFC] UString

2015-03-02 Thread Pierre Joye
On Mon, Mar 2, 2015 at 12:48 AM, Nikita Popov nikita@gmail.com wrote:
 On Tue, Oct 21, 2014 at 9:06 AM, Joe Watkins pthre...@pthreads.org wrote:

 Morning internalz,

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)


 I'm not totally convinced by this proposal. We already have quite a number
 of extensions that deal with unicode text in one way or another (at least
 intl, mbstring and iconv). This adds yet another way of dealing with this
 issue - a way that will have to be combined with at least two other
 extensions (mbstring or iconv for input handling and conversion) and intl
 for any non-trivial operations. There's nothing wrong with adding another
 approach for unicode handling per se, but I'd like to have more empahsis on
 how this integrates with existing functionality and why it is implemented
 separately from it (especially intl), etc.

 On a more general note, I'd appreciate it if RFCs proposing the inclusion
 of extensions moved more of their content into the actual RFC, as opposed
 to being thin wrappers around the extension README/docs. We had this issue
 with the pecl_http RFC and the same applies here. I think the suggested API
 is a pretty important aspect of the proposal and as such should be included
 in the RFC and maybe also commented a bit ;)

Full ack. Both paragraph.

As of now, and based on the previous discussions pointed out the same
issues (minus the RFC one, but this is a detail, important, but a
detail), I am also not convinced this is the way to tackle the Unicode
text support. It should either be part of intl (and proposed to enable
intl always for 7, with other RFC) or main. Main has the advantage to
provide a easier integration with other extensions.

Cheers,
-- 
Pierre

@pierrejoye | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Lester,

On Mon, Mar 2, 2015 at 5:34 AM, Lester Caine les...@lsces.co.uk wrote:

 On 28/02/15 06:48, Joe Watkins wrote:
  This is just a quick note to announce my intention to ready this RFC
  for voting next week.

 Since there is nothing in this which needs any changes to the core then
 surly it simply needs to exist in pecl until such time as a proper
 replacement for unicode in core strings has been addressed? Since it
 will still require intl to provide those areas it does not support, and
 I question if we really need to provide yet another encoding converter.

 A unicode string handler that just handles UTF8 strings may be yet
 another stepping stone, but it still falls short of beings able to
 handle all of the internationalization problems and is simply an
 alternate to mbstring so one either runs both, or sit down and convert
 all the third party libraries to eliminate mbstring.

 Like http extension, it's not essential that it's loaded by default, and
 leaving it in pecl allows development outside that of the core?


Although it seems current code does not have code like GMP. I'm sure
we'll have this before release. i.e.

$new = $some_ustring . 'abc'; // $new is UString object

To implement feature like this, it cannot be PECL.

My only concern for this RFC performance. It's loosely integrated into PHP
core, it may affect efficiency. I suppose other people are working on simple
and tighter integration into core. Any comments on this?

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Rowan Collins

On 01/03/2015 21:26, Yasuo Ohgaki wrote:

Although it seems current code does not have code like GMP. I'm sure
we'll have this before release. i.e.

$new = $some_ustring . 'abc'; // $new is UString object

To implement feature like this, it cannot be PECL.


Why not? I would have thought any extension can hook into the operator 
overloading API that GMP uses, just as they can hook into other object 
behaviours.


Is there some difference between how bundled and PECL extensions are 
loaded that would prevent this?


Regards,

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Rowan,

On Mon, Mar 2, 2015 at 6:32 AM, Rowan Collins rowan.coll...@gmail.com
wrote:

 On 01/03/2015 21:26, Yasuo Ohgaki wrote:

 Although it seems current code does not have code like GMP. I'm sure
 we'll have this before release. i.e.

 $new = $some_ustring . 'abc'; // $new is UString object

 To implement feature like this, it cannot be PECL.


 Why not? I would have thought any extension can hook into the operator
 overloading API that GMP uses, just as they can hook into other object
 behaviours.

 Is there some difference between how bundled and PECL extensions are
 loaded that would prevent this?


OK. I missed that GMP improvement includes generic operator overloading.
If current implementation is good enough for UString, it could be PECL.
Or add missing parts in core to make UString PECL.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe and Rowan,

On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.coll...@gmail.com
wrote:

 On 01/03/2015 20:34, Lester Caine wrote:

 On 28/02/15 06:48, Joe Watkins wrote:

  This is just a quick note to announce my intention to ready this RFC
 for voting next week.

 Since there is nothing in this which needs any changes to the core then
 surly it simply needs to exist in pecl until such time as a proper
 replacement for unicode in core strings has been addressed? Since it
 will still require intl to provide those areas it does not support, and
 I question if we really need to provide yet another encoding converter.

 A unicode string handler that just handles UTF8 strings may be yet
 another stepping stone, but it still falls short of beings able to
 handle all of the internationalization problems and is simply an
 alternate to mbstring so one either runs both, or sit down and convert
 all the third party libraries to eliminate mbstring.

 Like http extension, it's not essential that it's loaded by default, and
 leaving it in pecl allows development outside that of the core?


 I think this is probably a good idea at this stage. It will give people a
 chance to play around with it in an experimental state before committing
 to maintaining a particular API.

 Since there's no real BC break here, there's no reason it couldn't be
 bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush
 into including it in 7.0 straight from what feels like a prototype
 implementation.


Sounds reasonable.

Joe, I don't have much time to help, but I'm willing to help UString
development.
I think it's better to keep it simple. Having unified internal encoding
(NFC normalized
UTF-8 string without BOM) for internal string representation would be much
simpler
than multiple encodings.

We may consider various issues/ideas like this in relatively long term.
http://websec.github.io/unicode-security-guide/character-transformations/
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Florian,

On Mon, Mar 2, 2015 at 5:57 AM, Florian Margaine flor...@margaine.com
wrote:

 Le 1 mars 2015 21:26, Derick Rethans der...@php.net a écrit :
 
  Hey Joe,
 
  I think there are a few issues with the proposal, although I like the
  general idea. I've had the tab with the RFC open since October... but
  never looked at it until now :-/. So, a few comments:
 
  - UString as a name.
 
  I think I am going to prefer Text as a class name. Unicode (and
  intl/icu) have lots of operators acting on items containing unicode
  strings. But they are really pieces of text. For example sentences, word
  break iterators, etc. UString *feels* clunky, and not standard. If
  it's going to be part of PHP core, then we should pick a core name. (I
  might prefer String, but that's going to cause a whole lot of issues
  obviously).

 Isn't this solved if we use \php\String?


I suppose we need Context Sensitive Lexer for String, but I guess it
passes.

Let's use namespace for new internal classes at least.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Rowan Collins

On 01/03/2015 20:59, Yasuo Ohgaki wrote:

However, I don't mind too much allowing any encoding stored in Text/
UString object. IIRC, Ruby does this and have not much problem.


As I understand it, Ruby's string type is actually a whole bunch of 
overloaded types, each responsible for re-implementing the various 
methods available. This leads to a whole bunch of partially supported 
encodings/codepages, which is a big pile of leaky abstraction for the 
small benefit of removing re-encoding operations in a few scenarios.


Unicode is explicitly designed to supersede all previous encodings, so 
it makes much perfect sense to me to use it to internally represent what 
the user just wants to think of as text. The fact that within that 
internal representation you need some byte-level encoding then leads to 
the optimisation of using a byte-level encoding the user is likely to 
use as input and output, i.e. UTF-8.


Regards,

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe,

On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins pthre...@pthreads.org wrote:

 This is just a quick note to announce my intention to ready this RFC
 for voting next week.

 I know I'm a little late maybe, I was real sick most of last week, so
 couldn't do anything useful.

 A couple of us intend to fix outstanding issues on github and those
 raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
 concerns that are not mentioned asap.


I appreciate your proposal!
Rowan pointed out some important things. I don't understand details as I
don't read your code yet. I'll try to read and comment in a few days.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe,

On Sun, Mar 1, 2015 at 7:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote:

public function __construct([string $string [, string $source_codepage
 [, string $substitute_char] ]);


One additional comment for constructor. It should have default
normalization. I think
it should be NFC as most system uses it. (OSX uses NFD for filenames! I
hate it and
most of Japanese developers hate it)

The API may be

public function __construct([string $string [, string $source_codepage [,
string $substitute_char [, $normalization] ]);

If $substitute_char is NULL, disallow invalid encoding.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe,

On Sun, Mar 1, 2015 at 6:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote:

 On Sat, Feb 28, 2015 at 3:48 PM, Joe Watkins pthre...@pthreads.org
 wrote:

 This is just a quick note to announce my intention to ready this RFC
 for voting next week.

 I know I'm a little late maybe, I was real sick most of last week, so
 couldn't do anything useful.

 A couple of us intend to fix outstanding issues on github and those
 raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
 concerns that are not mentioned asap.


 I appreciate your proposal!
 Rowan pointed out some important things. I don't understand details as I
 don't read your code yet. I'll try to read and comment in a few days.


I guess you would like to start voting today or tomorrow, so I briefly read
your code.
I think your approach is good. I like UString be UTF-8 always by default
regardless
of other settings. i.e. default_charset, internal_encoding.

I see few missing key APIs that would be critical for multibyte char
handling, like
string length, string width, normalization, string conversions like Zenkaku
to Hankaku,
encoding(codepage) converter.  However, all of these may be added later as
they
are already implemented in ICU.

I think UString may be better to use UTF-8 always to make users life a
little simpler.
Your constructor only have codepage setting that is used as UString
codepage to support
other codepage(encodings).

Rather than to have various encoding support, I think constructor needs
encoding(codepage)
conversion feature. Codepage parameter is better to be used as from
encoding(codepage)
parameter and convert any encoding(codepage) to UTF-8. If conversion fails,
it should raise
exception. It's better to have forgiving API for malformed strings if user
explicitly specified to do so.

Constructor may be

   public function __construct([string $string [, string $source_codepage
[, string $substitute_char] ]);

$soure_codepage is source string encoding(codepage) and $string is
converted to UTF-8 always.
If $substitute_char is omitted, raise exception for invalid $string.
If $substitute_char is specified (it can be '' empty string), convert
$string according to $source_codepage
and just remove/replace invalid byte stream in $string.

With this constructor, string stored in UString object is always valid
UTF-8. Any character encoding
(including UTF-16/32 and 200 encoding names supported by ICU) may be used
as source string.

Since there will be no variable codepage setting for UString object,
followings may be removed.

public static function getDefaultCodepage();
public static function setDefaultCodepage(string $codepage);

ICU uses codepage as character encoding, but it may be better to use
character
encoding as people are not used to ICU terminology.

This is what I thought. I didn't read your code carefully, so I might be
wrong. Please
correct me if I'm mistaken.

I suppose there are other people working on Unicode string based simpler
libraries.
I would like to hear opinion from them.

BTW, we really need byte_len(). strlen() is just confusing API... It's not
a scope of
this RFC, though.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Rowan Collins

On 01/03/2015 20:34, Lester Caine wrote:

On 28/02/15 06:48, Joe Watkins wrote:

 This is just a quick note to announce my intention to ready this RFC
for voting next week.

Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?



I think this is probably a good idea at this stage. It will give people 
a chance to play around with it in an experimental state before 
committing to maintaining a particular API.


Since there's no real BC break here, there's no reason it couldn't be 
bundled into 7.1 if it was deemed ready by then, so it seems unwise to 
rush into including it in 7.0 straight from what feels like a prototype 
implementation.


Regards,

--
Rowan Collins
[IMSoP]


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe and Rowan,

On Mon, Mar 2, 2015 at 7:14 AM, Yasuo Ohgaki yohg...@ohgaki.net wrote:

 Hi Joe and Rowan,

 On Mon, Mar 2, 2015 at 6:37 AM, Rowan Collins rowan.coll...@gmail.com
 wrote:

 On 01/03/2015 20:34, Lester Caine wrote:

 On 28/02/15 06:48, Joe Watkins wrote:

  This is just a quick note to announce my intention to ready this
 RFC
 for voting next week.

 Since there is nothing in this which needs any changes to the core then
 surly it simply needs to exist in pecl until such time as a proper
 replacement for unicode in core strings has been addressed? Since it
 will still require intl to provide those areas it does not support, and
 I question if we really need to provide yet another encoding converter.

 A unicode string handler that just handles UTF8 strings may be yet
 another stepping stone, but it still falls short of beings able to
 handle all of the internationalization problems and is simply an
 alternate to mbstring so one either runs both, or sit down and convert
 all the third party libraries to eliminate mbstring.

 Like http extension, it's not essential that it's loaded by default, and
 leaving it in pecl allows development outside that of the core?


 I think this is probably a good idea at this stage. It will give people a
 chance to play around with it in an experimental state before committing
 to maintaining a particular API.

 Since there's no real BC break here, there's no reason it couldn't be
 bundled into 7.1 if it was deemed ready by then, so it seems unwise to rush
 into including it in 7.0 straight from what feels like a prototype
 implementation.


 Sounds reasonable.

 Joe, I don't have much time to help, but I'm willing to help UString
 development.
 I think it's better to keep it simple. Having unified internal encoding
 (NFC normalized
 UTF-8 string without BOM) for internal string representation would be much
 simpler
 than multiple encodings.

 We may consider various issues/ideas like this in relatively long term.
 http://websec.github.io/unicode-security-guide/character-transformations/
 http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html


We used to have EXPERIMENTAL module.
How about have this as EXPERIMENTAL module in source distribution?
It gets more attentions and development will be faster.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Derick Rethans
Hey Joe,

I think there are a few issues with the proposal, although I like the 
general idea. I've had the tab with the RFC open since October... but 
never looked at it until now :-/. So, a few comments:

- UString as a name.

I think I am going to prefer Text as a class name. Unicode (and 
intl/icu) have lots of operators acting on items containing unicode 
strings. But they are really pieces of text. For example sentences, word 
break iterators, etc. UString *feels* clunky, and not standard. If 
it's going to be part of PHP core, then we should pick a core name. (I 
might prefer String, but that's going to cause a whole lot of issues 
obviously).

- Needs More Methods

I had a look at the API that that links to, and I miss operators like 
iterators. Over words, sentences, characters, etc. Basically the 
functionality of  
http://docs.php.net/manual/en/class.intlbreakiterator.php, 
http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and 
http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

I realize intl already immplements, this, but it's really beneficial to 
have for a Text class - especially for replacing functionality where 
people now look over a string - with a character index. 

- Not a full String API Replacement

I would certainly expect more from it than just the UnicodeString API. 
Perhaps not for a first iteration, but certainly for subsequent 
versions. Things like transliterations, and specifically iterators would 
be high on my list.

- Patch

toUpper/toLower, there is a missing one for toTitle

- In the code's README:

Note: UString is interchangable with zend strings for method parameters 
and can be cast for output/conversion to zend strings

How does that work? And what would it convert to?

- How are characters counted?

Is a character a Code Point, or is a character a base character + 
combining diacritics. In the first form, A + ° is considered as 
characters, in the second option, just one. For wordwrap, splice, 
substring, it is really important that only the *full sequence* is 
considered as a character. And hence, a character really should be the 
full sequence. The text in charAt seems to contradict that, and that 
is a mistake.

In the original PHP 6 we didn't do that due to perormance reasons, but 
that point is moot now as only people who opt into using Text will 
suffer from this.

- trim

What is a leading or trailing space? Is it just U+0020, or other Unicode 
defined space characters as well? (nbsp;, U+00A0 comes to mind here)

- What is UG(defaultpad), about?

- For the code:

  - there is some interesting, non standard whitespaceing going on:

- { goes on next line after a func decl
- sometimes 4 spaces in stead of a tab are used for indentation, 

- Why is there no __toString() ?

- How can other extensions, not really making use of Text, use there 
  strings (as UTF8 strings f.e.)


cheers,
Derick


On Sat, 28 Feb 2015, Joe Watkins wrote:

 Morning internals,
 
 This is just a quick note to announce my intention to ready this RFC
 for voting next week.
 
 I know I'm a little late maybe, I was real sick most of last week, so
 couldn't do anything useful.
 
 A couple of us intend to fix outstanding issues on github and those
 raised here, tidy the RFC and open the vote for 7.
 
I would ask anyone interested to scan through this thread and announce
 concerns that are not mentioned asap.
 
 Cheers
 Joe
 
 On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net wrote:
 
  On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote:
 
  On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
   Hi!
  
P.S. u() is a bad name, will break lots of code, i.e.
  
   Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
  safe.
  
 
  /me cringes ...
 
  I wonder how much of a problem it really is, usually when we say some
  function name is a problem is because of hundreds and hundreds of
  results on github.
 
  If it's a huge problem then we should rename it, if we have to dig
  around for a single project that's incompatible, or even a handful, then
  it's not really a problem.
 
  Cheers
  Joe
 
 
  I can see this being something relatively common. While I personally would
  never do it, there are a few reasons I can think of that people *might* do
  it:
 
  - Wrapper for creating u HTML output
  - urlencode() shortcut
  - (obviously) various unicode-related things
 
  Searching on codesearch [1] revealed (amongst a few other hits on the
  first page) another interesting use of it in the hhvm test suite [2]. It's
  difficult to search for this because all the available public search
  engines that I know of do fuzzy matching.
 
  Sorry. This sucks, because every other option we have for this is sucks.
 
  On the bright side, anything chosen could always be aliased at the top of
  the file:
 
  use function __u as u;
 
  This also sucks, but it sucks a little bit less because the 

Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Derick Rethans
On Sun, 1 Mar 2015, Yasuo Ohgaki wrote:

 Hi Joe,
 
 On Sun, Mar 1, 2015 at 7:14 PM, Yasuo Ohgaki yohg...@ohgaki.net wrote:
 
 public function __construct([string $string [, string $source_codepage
  [, string $substitute_char] ]);
 
 One additional comment for constructor. It should have default 
 normalization. I think it should be NFC as most system uses it. (OSX 
 uses NFD for filenames! I hate it and most of Japanese developers hate 
 it)
 
 The API may be
 
 public function __construct([string $string [, string $source_codepage [,
 string $substitute_char [, $normalization] ]);

I wouldn't leave normalization as an option, and certainly not done 
by default. I would suggest other (mutable) methods, to convert between 
normalisation forms.

 If $substitute_char is NULL, disallow invalid encoding.

I don't think substitions (ie, data loss) should be allowed at all. This 
should thrown an immediate exception. If you really want this, I suggest 
adding a factory method for this. i.e. Text::createWithSubstitutions - 
or whatever better name.

cheers,
Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Lester Caine
On 28/02/15 06:48, Joe Watkins wrote:
 This is just a quick note to announce my intention to ready this RFC
 for voting next week.

Since there is nothing in this which needs any changes to the core then
surly it simply needs to exist in pecl until such time as a proper
replacement for unicode in core strings has been addressed? Since it
will still require intl to provide those areas it does not support, and
I question if we really need to provide yet another encoding converter.

A unicode string handler that just handles UTF8 strings may be yet
another stepping stone, but it still falls short of beings able to
handle all of the internationalization problems and is simply an
alternate to mbstring so one either runs both, or sit down and convert
all the third party libraries to eliminate mbstring.

Like http extension, it's not essential that it's loaded by default, and
leaving it in pecl allows development outside that of the core?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Derick Rethans
On Sat, 28 Feb 2015, Rowan Collins wrote:

 On 28/02/2015 06:48, Joe Watkins wrote:
  Morning internals,
  
   This is just a quick note to announce my intention to ready this RFC
  for voting next week.
  
   I know I'm a little late maybe, I was real sick most of last week, so
  couldn't do anything useful.
  
   A couple of us intend to fix outstanding issues on github and those
  raised here, tidy the RFC and open the vote for 7.
  
  I would ask anyone interested to scan through this thread and announce
  concerns that are not mentioned asap.
 
 I still think this class is trying to do several jobs, and not doing any of
 them very well, and I fear that people will see this class and expect it to
 solve problems which it actually ignores.
 
 Here are some concrete use cases I would like a simple interface to solve for
 me:
 
 - Take text from an ISO 88592-2 data source, pass it through generic text
 filters, and pass it to a UTF-16 data target.
 - Given a long string of Unicode text, give me a valid UTF-8 string which fits
 into a buffer with fixed byte size; i.e. give me the largest number of whole
 code points which fit into that number of bytes once encoded.
 - As above, but without stripping diacritics off the last character of the
 resulting string, i.e. give me the largest number of whole graphemes which
 fit.
 - Split a string into equal sized chunks of readable characters (graphemes),
 regardless of how many bytes or code points each chunk contains.
 
 UString currently falls short of all of these:
 
 - I can specify my input encoding (in the constructor or helper method,
 over-riding a static default, which is equivalent to ext/mbstring's global
 setting), but not my output encoding (there is no method to ask for a byte
 representation other than a string cast, which by definition has no
 parameters).

Yeah, there should be an output method to convert to a target encoding.

 - I can ask for a fixed number of code points, but don't know how many bytes
 these will take until I cast to a UTF-8 string.

As I said before, indexes into strings should not be done on code 
points, as the following would then break the characters:

$s = new Text(Ås);
echo $s-substring(1);

The output would be:̊  

Where as:

$s = new Text(Ås);
echo $s-substring(1);

would output s.

Which is not what people would expect.

 - I can't manipulate anything at the grapheme level at all, even though this
 is the most meaningful level of operation in most cases.

Yes - graphemes should be the base blocks, not code points.

 Things it does do:
 
 - a handful of methods give meaningful international text support: toUpper(),
 toLower(),  trim()
 - some methods could be done on byte strings if I ensure they're all in UTF-8:
 replace(), contains(), startsWith(), endsWith(), repeat()

That doesn't always work when you have graphemes, or text in different 
normalisation forms. Ie, it should consider Å U+00C5 and Å (U+0041 + 
U+030A) the same for contains and startsWith — ie, handle normalisation 
for comparison.

 - there may be limited situations where I want to dive into the code points
 which make up a string, although I can't think of many: $length, pad(),
 indexOf(), lastIndexOf(), charAt(), replaceSlice()

Break iterators on either code points, or graphemes, might work here?

 - remaining methods avoid me creating invalid UTF-8, but don't help me 
 much with real-life text: chunk(), split(), substring() - I can ask 
 what codepage my Unicode string is in; I don't even understand what 
 this means
 
 I think an efficient OO wrapper around ICU is a great idea, but more 
 thought needs to go into what methods are exposed, and how people are 
 going to use them in real code.

Yes - I agree. I think this current proposal is a good start, but it 
needs to be worked out a little bit more before I think we should vote 
on it — how much I would like to see something like this in PHP.

cheers,
Derick

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug
Posted with an email client that doesn't mangle email: alpine
-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Florian Margaine
Hi,

Le 1 mars 2015 21:26, Derick Rethans der...@php.net a écrit :

 Hey Joe,

 I think there are a few issues with the proposal, although I like the
 general idea. I've had the tab with the RFC open since October... but
 never looked at it until now :-/. So, a few comments:

 - UString as a name.

 I think I am going to prefer Text as a class name. Unicode (and
 intl/icu) have lots of operators acting on items containing unicode
 strings. But they are really pieces of text. For example sentences, word
 break iterators, etc. UString *feels* clunky, and not standard. If
 it's going to be part of PHP core, then we should pick a core name. (I
 might prefer String, but that's going to cause a whole lot of issues
 obviously).

Isn't this solved if we use \php\String?


 - Needs More Methods

 I had a look at the API that that links to, and I miss operators like
 iterators. Over words, sentences, characters, etc. Basically the
 functionality of
 http://docs.php.net/manual/en/class.intlbreakiterator.php,
 http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
 http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

 I realize intl already immplements, this, but it's really beneficial to
 have for a Text class - especially for replacing functionality where
 people now look over a string - with a character index.

 - Not a full String API Replacement

 I would certainly expect more from it than just the UnicodeString API.
 Perhaps not for a first iteration, but certainly for subsequent
 versions. Things like transliterations, and specifically iterators would
 be high on my list.

 - Patch

 toUpper/toLower, there is a missing one for toTitle

 - In the code's README:

 Note: UString is interchangable with zend strings for method parameters
 and can be cast for output/conversion to zend strings

 How does that work? And what would it convert to?

 - How are characters counted?

 Is a character a Code Point, or is a character a base character +
 combining diacritics. In the first form, A + ° is considered as
 characters, in the second option, just one. For wordwrap, splice,
 substring, it is really important that only the *full sequence* is
 considered as a character. And hence, a character really should be the
 full sequence. The text in charAt seems to contradict that, and that
 is a mistake.

 In the original PHP 6 we didn't do that due to perormance reasons, but
 that point is moot now as only people who opt into using Text will
 suffer from this.

 - trim

 What is a leading or trailing space? Is it just U+0020, or other Unicode
 defined space characters as well? (nbsp;, U+00A0 comes to mind here)

 - What is UG(defaultpad), about?

 - For the code:

   - there is some interesting, non standard whitespaceing going on:

 - { goes on next line after a func decl
 - sometimes 4 spaces in stead of a tab are used for indentation,

 - Why is there no __toString() ?

 - How can other extensions, not really making use of Text, use there
   strings (as UTF8 strings f.e.)


 cheers,
 Derick


 On Sat, 28 Feb 2015, Joe Watkins wrote:

  Morning internals,
 
  This is just a quick note to announce my intention to ready this RFC
  for voting next week.
 
  I know I'm a little late maybe, I was real sick most of last week,
so
  couldn't do anything useful.
 
  A couple of us intend to fix outstanding issues on github and those
  raised here, tidy the RFC and open the vote for 7.
 
 I would ask anyone interested to scan through this thread and
announce
  concerns that are not mentioned asap.
 
  Cheers
  Joe
 
  On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net
wrote:
 
   On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote:
  
   On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
Hi!
   
 P.S. u() is a bad name, will break lots of code, i.e.
   
Maybe __u()? It's a bit ugly but you're not allowed to use __ so
it's
   safe.
   
  
   /me cringes ...
  
   I wonder how much of a problem it really is, usually when we say some
   function name is a problem is because of hundreds and hundreds of
   results on github.
  
   If it's a huge problem then we should rename it, if we have to dig
   around for a single project that's incompatible, or even a handful,
then
   it's not really a problem.
  
   Cheers
   Joe
  
  
   I can see this being something relatively common. While I personally
would
   never do it, there are a few reasons I can think of that people
*might* do
   it:
  
   - Wrapper for creating u HTML output
   - urlencode() shortcut
   - (obviously) various unicode-related things
  
   Searching on codesearch [1] revealed (amongst a few other hits on the
   first page) another interesting use of it in the hhvm test suite [2].
It's
   difficult to search for this because all the available public search
   engines that I know of do fuzzy matching.
  
   Sorry. This sucks, because every other option we have for this is

Re: [PHP-DEV] [RFC] UString

2015-03-01 Thread Yasuo Ohgaki
Hi Joe and Derick,

On Mon, Mar 2, 2015 at 5:25 AM, Derick Rethans der...@php.net wrote:

 I think there are a few issues with the proposal, although I like the
 general idea. I've had the tab with the RFC open since October... but
 never looked at it until now :-/. So, a few comments:

 - UString as a name.

 I think I am going to prefer Text as a class name. Unicode (and
 intl/icu) have lots of operators acting on items containing unicode
 strings. But they are really pieces of text. For example sentences, word
 break iterators, etc. UString *feels* clunky, and not standard. If
 it's going to be part of PHP core, then we should pick a core name. (I
 might prefer String, but that's going to cause a whole lot of issues
 obviously).


I think it's better to have string/text data as certain encoding/codepage.
Although Unicode encoding conversion is cheap, (I mean cheap compare
to conversion to other encodings, like SJIS, EUC, ISO-2022, etc), UTF-8
is better because

 - PCRE only supports UTF-8
 - SQLite only supports UTF-8
 - PHP uses UTF-8 as the default now
 - Recent web apps uses UTF-8 as encoding
 - Single encoding for stored text/string is simpler
 - Considering normalization, having UTF-8 with NFC is less confusing.

However, I don't mind too much allowing any encoding stored in Text/
UString object. IIRC, Ruby does this and have not much problem.

If we have multiple encoding support. We should resolve

$new = $str_utf8 . $str_sjis; // $new is UTF-8 or SJIS? Raise error?
$new = $str_nfc . $str_nfd; // $new is NFC or NFD, mixed? Raise error?
$new = $str_utf16le . $str_utf16be; // $new is ?? How BOM is handled?


 - Needs More Methods

 I had a look at the API that that links to, and I miss operators like
 iterators. Over words, sentences, characters, etc. Basically the
 functionality of
 http://docs.php.net/manual/en/class.intlbreakiterator.php,
 http://docs.php.net/manual/en/class.intlrulebasedbreakiterator.php and
 http://docs.php.net/manual/en/class.intlcodepointbreakiterator.php

 I realize intl already immplements, this, but it's really beneficial to
 have for a Text class - especially for replacing functionality where
 people now look over a string - with a character index.


There are missing features... We may implement most of them before
release.


 - Not a full String API Replacement

 I would certainly expect more from it than just the UnicodeString API.
 Perhaps not for a first iteration, but certainly for subsequent
 versions. Things like transliterations, and specifically iterators would
 be high on my list.


Sounds good.



 - Patch

 toUpper/toLower, there is a missing one for toTitle

 - In the code's README:

 Note: UString is interchangable with zend strings for method parameters
 and can be cast for output/conversion to zend strings

 How does that work? And what would it convert to?


I guess Joe means it's using zend_string internally?



 - How are characters counted?

 Is a character a Code Point, or is a character a base character +
 combining diacritics. In the first form, A + ° is considered as
 characters, in the second option, just one. For wordwrap, splice,
 substring, it is really important that only the *full sequence* is
 considered as a character. And hence, a character really should be the
 full sequence. The text in charAt seems to contradict that, and that
 is a mistake.


One reason I prefer NFC.



 In the original PHP 6 we didn't do that due to perormance reasons, but
 that point is moot now as only people who opt into using Text will
 suffer from this.

 - trim

 What is a leading or trailing space? Is it just U+0020, or other Unicode
 defined space characters as well? (nbsp;, U+00A0 comes to mind here)


Any space is better to be trimmed.



 - What is UG(defaultpad), about?

 - For the code:

   - there is some interesting, non standard whitespaceing going on:

 - { goes on next line after a func decl
 - sometimes 4 spaces in stead of a tab are used for indentation,

 - Why is there no __toString() ?


If this is missing, there should be __toString()



 - How can other extensions, not really making use of Text, use there
   strings (as UTF8 strings f.e.)


I agree that Internal API needs improvement.

Overall, I think it's good for starting if basic issue is resolved.
The most important is if it supports single or multiple encoding for
stored text/string?.
There are many things programmers should know if multiple encoding is
supported,
but I don't object strongly to have multiple encoding support. It's nice to
have ability
to handle SJIS, ISO-2022, etc natively.

Regards,

--
Yasuo Ohgaki
yohg...@ohgaki.net


Re: [PHP-DEV] [RFC] UString

2015-02-28 Thread Rowan Collins

On 28/02/2015 06:48, Joe Watkins wrote:

Morning internals,

 This is just a quick note to announce my intention to ready this RFC
for voting next week.

 I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.

 A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.


I still think this class is trying to do several jobs, and not doing any 
of them very well, and I fear that people will see this class and expect 
it to solve problems which it actually ignores.


Here are some concrete use cases I would like a simple interface to 
solve for me:


- Take text from an ISO 88592-2 data source, pass it through generic 
text filters, and pass it to a UTF-16 data target.
- Given a long string of Unicode text, give me a valid UTF-8 string 
which fits into a buffer with fixed byte size; i.e. give me the largest 
number of whole code points which fit into that number of bytes once 
encoded.
- As above, but without stripping diacritics off the last character of 
the resulting string, i.e. give me the largest number of whole graphemes 
which fit.
- Split a string into equal sized chunks of readable characters 
(graphemes), regardless of how many bytes or code points each chunk 
contains.


UString currently falls short of all of these:

- I can specify my input encoding (in the constructor or helper method, 
over-riding a static default, which is equivalent to ext/mbstring's 
global setting), but not my output encoding (there is no method to ask 
for a byte representation other than a string cast, which by definition 
has no parameters).
- I can ask for a fixed number of code points, but don't know how many 
bytes these will take until I cast to a UTF-8 string.
- I can't manipulate anything at the grapheme level at all, even though 
this is the most meaningful level of operation in most cases.


Things it does do:

- a handful of methods give meaningful international text support: 
toUpper(), toLower(),  trim()
- some methods could be done on byte strings if I ensure they're all in 
UTF-8: replace(), contains(), startsWith(), endsWith(), repeat()
- there may be limited situations where I want to dive into the code 
points which make up a string, although I can't think of many: $length, 
pad(), indexOf(), lastIndexOf(), charAt(), replaceSlice()
- remaining methods avoid me creating invalid UTF-8, but don't help me 
much with real-life text: chunk(), split(), substring()
- I can ask what codepage my Unicode string is in; I don't even 
understand what this means


I think an efficient OO wrapper around ICU is a great idea, but more 
thought needs to go into what methods are exposed, and how people are 
going to use them in real code.


Regards,
--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2015-02-27 Thread Joe Watkins
Morning internals,

This is just a quick note to announce my intention to ready this RFC
for voting next week.

I know I'm a little late maybe, I was real sick most of last week, so
couldn't do anything useful.

A couple of us intend to fix outstanding issues on github and those
raised here, tidy the RFC and open the vote for 7.

   I would ask anyone interested to scan through this thread and announce
concerns that are not mentioned asap.

Cheers
Joe

On Fri, Oct 24, 2014 at 3:01 PM, Chris Wright daveran...@php.net wrote:

 On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote:

 On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
  Hi!
 
   P.S. u() is a bad name, will break lots of code, i.e.
 
  Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
 safe.
 

 /me cringes ...

 I wonder how much of a problem it really is, usually when we say some
 function name is a problem is because of hundreds and hundreds of
 results on github.

 If it's a huge problem then we should rename it, if we have to dig
 around for a single project that's incompatible, or even a handful, then
 it's not really a problem.

 Cheers
 Joe


 I can see this being something relatively common. While I personally would
 never do it, there are a few reasons I can think of that people *might* do
 it:

 - Wrapper for creating u HTML output
 - urlencode() shortcut
 - (obviously) various unicode-related things

 Searching on codesearch [1] revealed (amongst a few other hits on the
 first page) another interesting use of it in the hhvm test suite [2]. It's
 difficult to search for this because all the available public search
 engines that I know of do fuzzy matching.

 Sorry. This sucks, because every other option we have for this is sucks.

 On the bright side, anything chosen could always be aliased at the top of
 the file:

 use function __u as u;

 This also sucks, but it sucks a little bit less because the collisions are
 avoided - or at least, avoided in such a way that the onus is on the user -
 and one can still have the sane name.

 First-class support at the syntax level (presumably $foo = uunicode
 string since we already have $foo = bbinary string) would IMO be better
 and (hopefully?) a long-term goal, but I am aware that it is - and probably
 should be - outside the scope of the current proposal.

 [1] https://searchcode.com/?q=function+u+lang%3Aphp
 [2]
 https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13



Re: [PHP-DEV] [RFC] UString

2014-10-24 Thread Joe Watkins
On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
 Hi!
 
  P.S. u() is a bad name, will break lots of code, i.e.
 
 Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.
 

/me cringes ...

I wonder how much of a problem it really is, usually when we say some
function name is a problem is because of hundreds and hundreds of
results on github.

If it's a huge problem then we should rename it, if we have to dig
around for a single project that's incompatible, or even a handful, then
it's not really a problem.

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-24 Thread Chris Wright
On 24 October 2014 07:03, Joe Watkins pthre...@pthreads.org wrote:

 On Thu, 2014-10-23 at 12:54 -0700, Stas Malyshev wrote:
  Hi!
 
   P.S. u() is a bad name, will break lots of code, i.e.
 
  Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's
 safe.
 

 /me cringes ...

 I wonder how much of a problem it really is, usually when we say some
 function name is a problem is because of hundreds and hundreds of
 results on github.

 If it's a huge problem then we should rename it, if we have to dig
 around for a single project that's incompatible, or even a handful, then
 it's not really a problem.

 Cheers
 Joe


I can see this being something relatively common. While I personally would
never do it, there are a few reasons I can think of that people *might* do
it:

- Wrapper for creating u HTML output
- urlencode() shortcut
- (obviously) various unicode-related things

Searching on codesearch [1] revealed (amongst a few other hits on the first
page) another interesting use of it in the hhvm test suite [2]. It's
difficult to search for this because all the available public search
engines that I know of do fuzzy matching.

Sorry. This sucks, because every other option we have for this is sucks.

On the bright side, anything chosen could always be aliased at the top of
the file:

use function __u as u;

This also sucks, but it sucks a little bit less because the collisions are
avoided - or at least, avoided in such a way that the onus is on the user -
and one can still have the sane name.

First-class support at the syntax level (presumably $foo = uunicode
string since we already have $foo = bbinary string) would IMO be better
and (hopefully?) a long-term goal, but I am aware that it is - and probably
should be - outside the scope of the current proposal.

[1] https://searchcode.com/?q=function+u+lang%3Aphp
[2]
https://github.com/facebook/hhvm/blob/master/hphp/test/slow/ext_icu/uspoof.php#L13


Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote:
 Hi!
 
  I wish there was a way for specific objects to opt into this.
 
 There will be, if __hashKey() or whatever would be the properly
 bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
 be hard to do and it's exactly what many other languages do when trying
 to use objects as keys for maps.
 
 

Not ready for discussion yet ...

https://wiki.php.net/rfc/hashkey

But it exists, I think it solves a problem for ustring in particular but
it solves the problem in general too. No time to write about it or
discuss it at this moment, but in pipeline, hopefully ...

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Tue, 2014-10-21 at 21:42 +0100, Rowan Collins wrote:
 On 21/10/2014 08:06, Joe Watkins wrote:
  Morning internalz,
 
  https://wiki.php.net/rfc/ustring
 
  This is the result of work done by a few of us, we won't be opening any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.
 
  Now seems like a good time to start the conversation so we can hash out
  the details, or get on with other things ;)
 
  Cheers
  Joe
 
 
 
 I think this looks like a really great start at creating something 
 actually useful, rather than getting stuck at the drawing board. I like 
 that the scope is quite small initially - where does the single 
 responsibility of a class that represents a string end, anyway? :)
 
 A few opinions:
 
 1) Global / static defaults are bad.
 
 The existence of the setDefaultCodepage method feels like an 
 anti-pattern to me. It means libraries can't rely on this class working 
 the same way in two different host environments, or even at two 
 re-entries in the same program. Effectively, if you don't know what the 
 second argument to the constructor will default to, you can't actually 
 treat it as optional unless you're writing monolithic code. This is a 
 common pattern in PHP, but http_build_query() would be so much more 
 pleasant if I could safely call it with 1 argument instead of 3.
 
 I think the default should be hard-coded to UTF-8, which according to 
 previous discussion is always the default *output* encoding, so would 
 mean this would always work: $aUString = new UString( (string)$aUString 
 ); Any other encoding will be dependent on, and known from, the context 
 where the object is created - if grabbing data from an HTTP request, a 
 header should tell them; if from a database, a connection parameter; and 
 so on.
 

Could be true, it feels quite horrible to me today too, I think someone
else suggested it, but it might have been me.

I'll look at doing something about that ...

 The only case I can see where a default encoding would be sensible would 
 be where source code itself is in a different encoding, so that 
 u('literal string') works as expected. I guess if we ever went down the 
 route of special literal syntax like u'literal string', the declared 
 source encoding could be used.
 
 Actually, the u() shortcut function appears to be missing the encoding 
 parameter completely; is this deliberate?
 

Fixed that.

 2) Clarify relationship to a byte string
 
 Most of the API acts like this is an abstract object representing a 
 bunch of Unicode code points. As such, I'm not sure what getCodepage() 
 does - a code page (or more properly encoding) is a property of a stream 
 of bytes, so has no meaning in this context, surely? The internal 
 implementation could use UTF-8, UTF-16, or some made-up encoding (like 
 Perl6's NFG system) and the user should never need to know (other than 
 to understand performance implications).
 
 On the other hand, when you *do* want a stream of bytes, the class 
 doesn't seem to have an explicit way to get one. The (currently 
 undocumented) behaviour is apparently to spit out UTF-8 if cast to a 
 string, but it would be nice to have an explicit function which could be 
 passed a parameter in order to serialise to, say, UTF-16, instead.
 

I reused the terminology used by ICU, it made sense in their
documentation. 

So we want a ::getBytes or something like that ... I'll do that ...

 3) The Grapheme Question
 
 This has been raised a few times, so I won't labour the point, just 
 mention my current thinking.
 
 Unicode is complicated. Partly, that's because of a series of 
 compromises in its design; but partly, it's because writing systems are 
 complicated, and Unicode tries harder than most previous systems to 
 acknowledge that. So, there's a tradeoff to be made between giving users 
 what they think they need, thus hiding the messy details, and giving 
 users the power to do things right, in a more complex way.
 
 There is also a namespace mess if you insist on every function and 
 property having to declare what level of abstraction it's talking about 
 - e.g. $codePointLength instead of $length.
 
 An idea I've been toying with is rather than having one class 
 representing the slippery notion of a Unicode string, having (at 
 least) two, closely tied, classes: CodePointString (roughly = UString 
 right now) and GraphemeString (a higher level abstraction tied to the 
 same internal representation).
 
 I intend to mock this up as a set of interfaces at some point, but the 
 basic idea is that you could write this:
 
 // Get an abstract object from a byte string, probably a GraphemeString, 
 parsing the input as UTF-8
 $str = u('some text');
 // Perform an operation that explicitly deals in Code Points
 $str = $str-asCodePoints()-normalise('NFC');
 // Get information using a higher level of abstraction
 $length = $str-asGraphemes()-length;
 // Perform a high-level mutation, then convert right 

Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Tue, 2014-10-21 at 10:28 -0700, Stas Malyshev wrote:
 Hi!
 
  https://wiki.php.net/rfc/ustring
  
  This is the result of work done by a few of us, we won't be opening any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.
 
 Couple of thoughts:
 - I like the idea of having a unicode string class. May be a way to
 figure out the right way to do it without messing up the whole core.
 
 - I wish there were more description of which API this class provides.
 If it's planned to be direct copy of UnicodeString, some of the
 operations there are not how PHP strings usually work (i.e. in-place
 modification) and it's not really enough to make it useful - e.g. what
 if I need to do regexps on it, for example? Or does it cover whole
 mbstring API too? What about something mbstring doesn't cover, like
 ucfirst or strrev?

API on github in readme.

Regexp not covered yet, ICU has a nicer Matcher/Pattern API like Java's,
I'm not sure what to do there, an ICU based API could certainly be
introduced.
 
 - Do we really need different encodings, different backends and so on,
 internally? Note that each backend has its own quirks, limitations and
 bugs, and there's nothing worse than dealing with unpredictable set of
 dependencies. The user cares what they send into the class and what
 comes out, but very rarely they care what happens inside - why not just
 do it one way everywhere?
 

No, actually, I don't think we do. It was over complicating something
simple, so I removed the backend abstraction and will work towards
solving the rest too.

We'll use ICU, because battle tested like nothing else, and keeps
everything simple ... it doesn't make sense to introduce a possibly
unstable and as you rightly say different API with it's own quirks.

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Tue, 2014-10-21 at 07:49 -0700, Sara Golemon wrote:
  On Oct 21, 2014, at 0:06, Joe Watkins pthre...@pthreads.org wrote:
  
  Morning internalz,
  
 https://wiki.php.net/rfc/ustring
  
 This is the result of work done by a few of us, we won't be opening any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.
  
 
 The backend abstraction seems overengineered to me.  It could also lead to 
 inconsistencies in behavior if ICU and Windows implement something in subtly 
 different ways.
 
 Since we're linking ICU for the rest of the intl extension anyway, it seems 
 to me like we should just focus on it as an ICU wrapper.
 
 Also, I'd peopose a minor ammendment to this RFC that other intl classes be 
 extended to support taking UString instances as arguments (avoiding the 
 implicit conversion to UTF8). That work doesn't have to gate adoption of the 
 base implementation, it'd just be useful to decide at the same time if we 
 want to do so.
 
 -Sara

Actually I agree, I just needed a few people to say WTF.

Backend gone, we are gonna use ICU, rfc/ext updated.

INTL is still an open question yeah, preference noted.

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Dmitry Stogov
this won't completely solve the problem, because array keys won't be
UString anymore.

Thanks. Dmtiry.

On Thu, Oct 23, 2014 at 12:11 PM, Joe Watkins pthre...@pthreads.org wrote:

 On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote:
  Hi!
 
   I wish there was a way for specific objects to opt into this.
 
  There will be, if __hashKey() or whatever would be the properly
  bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
  be hard to do and it's exactly what many other languages do when trying
  to use objects as keys for maps.
 
 

 Not ready for discussion yet ...

 https://wiki.php.net/rfc/hashkey

 But it exists, I think it solves a problem for ustring in particular but
 it solves the problem in general too. No time to write about it or
 discuss it at this moment, but in pipeline, hopefully ...

 Cheers
 Joe




Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Rowan Collins

Joe Watkins wrote on 23/10/2014 09:18:

I'd rather higher level stuff existed at a higher level, I'd rather
solve for ustring the problems that are solved for normal strings and
leave the rest up to whatever the framework/component/library or wants
to do.


It's not really higher level in terms of the problem being solved, it's 
the same functions applied to a higher abstraction of what string 
means. It doesn't make much sense to say that u($foo)-length solves 
the same problem as strlen($foo), but grapheme_strlen($foo) is somehow 
higher level. They're three different definitions of the word length 
which can be applied to the same string, and it would be nice if they 
were all accessible through the same API.


I get the feeling people are thinking of grapheme functions as something 
exotic and hard to implement, but ext/intl seems to have a very 
straight-forward set of functions for them: 
http://php.net/manual/en/ref.intl.grapheme.php


The two-interfaces idea was just to get over the naming problem of 
prefixing everything with codePointX or graphemeX, and wouldn't actually 
require a separate data structure under the hood.

--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Thu, 2014-10-23 at 12:44 +0400, Dmitry Stogov wrote:
 this won't completely solve the problem, because array keys won't be
 UString anymore.

http://docs.oracle.com/javase/7/docs/api/java/lang/Object.html#hashCode()

Others solve this problem in exactly this way, the Java implementation
requires that you return an int.

The one in that draft will allow you to return any scalar. This is much
more suitable for PHP.

It doesn't solve the problem directly but allows the programmer to solve
it for themselves, just like Object.hashCode in Java.
 
 Thanks. Dmtiry.
 
 
 On Thu, Oct 23, 2014 at 12:11 PM, Joe Watkins pthre...@pthreads.org
 wrote:
 On Tue, 2014-10-21 at 10:30 -0700, Stas Malyshev wrote:
  Hi!
 
   I wish there was a way for specific objects to opt into
 this.
 
  There will be, if __hashKey() or whatever would be the
 properly
  bikeshedded name, becomes reality as discussed elsewhere. It
 shouldn't
  be hard to do and it's exactly what many other languages do
 when trying
  to use objects as keys for maps.
 
 
 
 Not ready for discussion yet ...
 
 https://wiki.php.net/rfc/hashkey
 
 But it exists, I think it solves a problem for ustring in
 particular but
 it solves the problem in general too. No time to write about
 it or
 discuss it at this moment, but in pipeline, hopefully ...
 
 Cheers
 Joe
 
Cheers
Joe



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Andrea Faulds

 On 23 Oct 2014, at 09:44, Dmitry Stogov dmi...@zend.com wrote:
 
 this won't completely solve the problem, because array keys won't be
 UString anymore.

Sure, but unless we turn arrays into SplObjectStorage that won’t change. Nobody 
wants to touch arrays and make them support other key types. Heck, my bigint 
RFC doesn’t even do that.

--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Rowan Collins

Dmitry Stogov wrote on 21/10/2014 10:01:

The right approach, would be extending zend_string with encoding and
then adopting near all functions working with zend_string to take
encoding into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).


Isn't that kind of what ext/mbstring does?

I think that treating Unicode as nothing more than an encoding, and 
trying to hide all its complexity from the user, is not particularly 
wise. Unicode isn't just ASCII, but bigger, so keeping the same API 
but making the implementation work with more characters isn't really 
Unicode support.


For instance, what does allowing Unicode strings as array keys 
actually mean? We already allow pretty much any sequence of bytes as an 
array key, so what we're actually talking about is that array-handling 
functions should be somehow Unicode aware. In the case of sorting 
functions, that means a mechanism for selecting a collation, even if you 
know how the strings are encoded.


There are a handful of operations which have an obvious meaning under 
Unicode - strtoupper(), for instance. It might be nice if those worked 
transparently with UStrings, but I don't think that really constitutes 
complete Unicode support either.


I think we're going to keep going round in circles unless we can really 
pin down what it means for a language to support Unicode.

--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Andrea Faulds

 On 23 Oct 2014, at 14:44, Rowan Collins rowan.coll...@gmail.com wrote:
 
 Dmitry Stogov wrote on 21/10/2014 10:01:
 The right approach, would be extending zend_string with encoding and
 then adopting near all functions working with zend_string to take
 encoding into account. But, of course, this is going to lead to much more
 complicated solution (with some slowdown).
 
 Isn't that kind of what ext/mbstring does?
 
 I think that treating Unicode as nothing more than an encoding, and trying to 
 hide all its complexity from the user, is not particularly wise. Unicode 
 isn't just ASCII, but bigger, so keeping the same API but making the 
 implementation work with more characters isn't really Unicode support”.

I’m inclined to agree here. Having an encoding-aware zend_string vs. having a 
Unicode-aware string aren’t quite the same. Certain string operations are only 
possible for certain encodings, and by supporting any encoding we risk making 
things confusing. I’d rather we convert everything to Unicode.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Johannes Schlüter
On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote:
 It doesn't solve the problem directly but allows the programmer to solve
 it for themselves, just like Object.hashCode in Java.

The point is that it won't work in this way:

   $a = [ $ustring = $value ];
   foreach ($a as $key = $v) {
   $key-ustring_method();
   }

but one needs something along the lines of

   $a = [ $ustring = $value ];
   foreach ($a as $key = $v) {
   Utring::fromHashCode($key)-ustring_method();
   }

which likely looses object identity.

It works but is not really nice :-)

johannes


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Andrea Faulds

 On 23 Oct 2014, at 14:53, Johannes Schlüter johan...@schlueters.de wrote:
 
 On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote:
 It doesn't solve the problem directly but allows the programmer to solve
 it for themselves, just like Object.hashCode in Java.
 
 The point is that it won't work in this way:
 
   $a = [ $ustring = $value ];
   foreach ($a as $key = $v) {
   $key-ustring_method();
   }
 
 but one needs something along the lines of
 
   $a = [ $ustring = $value ];
   foreach ($a as $key = $v) {
   Utring::fromHashCode($key)-ustring_method();
   }
 
 which likely looses object identity.
 
 It works but is not really nice :-)

u($key)-split(',')-...  works :)

--
Andrea Faulds
http://ajf.me/
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Johannes Schlüter
On Thu, 2014-10-23 at 14:59 +0100, Andrea Faulds wrote:
  On 23 Oct 2014, at 14:53, Johannes Schlüter johan...@schlueters.de wrote:
  
  On Thu, 2014-10-23 at 11:38 +0100, Joe Watkins wrote:
  It doesn't solve the problem directly but allows the programmer to solve
  it for themselves, just like Object.hashCode in Java.
  
  The point is that it won't work in this way:
  
$a = [ $ustring = $value ];
foreach ($a as $key = $v) {
$key-ustring_method();
}
  
  but one needs something along the lines of
  
$a = [ $ustring = $value ];
foreach ($a as $key = $v) {
Utring::fromHashCode($key)-ustring_method();
}
  
  which likely looses object identity.
  
  It works but is not really nice :-)
 
 u($key)-split(',')-...  works :)

While that's something else from the original example and makes this
behave not like an integral part of the language.

The proper solution would be a unicode type, but PHP 6 showed that this
is not going to work out and this is way better than what we have right
now, though and a good step in the right direction. We probably might
integrate it in the core language more and more.

My point is to stress that this is incomplete, as Dmitry said, and that
we should not take this alone as the final solution forever.

johannes

P.S. u() is a bad name, will break lots of code, i.e.
https://code.openhub.net/file?fid=wRj6MYm-GPDxPidisWYoLa23wFccid=CCYlIMOwTkss=fndef%3Aupp=0fl=PHPff=1filterChecked=truefp=126888mp,=1ml=1me=1md=1projSelected=true#L0
 will give weird runtime behavior as their definition is guarded by a  
function_exists check but both functions do completely different things..


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Stas Malyshev
Hi!

 Not ready for discussion yet ...
 
 https://wiki.php.net/rfc/hashkey

Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
we should combine them :)

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Stas Malyshev
Hi!

 P.S. u() is a bad name, will break lots of code, i.e.

Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Andrea Faulds

 On 23 Oct 2014, at 20:54, Stas Malyshev smalys...@sugarcrm.com wrote:
 
 P.S. u() is a bad name, will break lots of code, i.e.
 
 Maybe __u()? It's a bit ugly but you're not allowed to use __ so it's safe.

I don't like that. This might sound crazy, but what about adding Unicode string 
literals to the parser, e.g. ufoo bar\u{202e}你好? If the UString extension 
isn't available, just error. It wouldn't be the first time we had disableable 
syntax features (``), and this avoids any possible conflicts.
--
Andrea Faulds
http://ajf.me/
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Thu, 2014-10-23 at 12:47 -0700, Stas Malyshev wrote:
 Hi!
 
  Not ready for discussion yet ...
  
  https://wiki.php.net/rfc/hashkey
 
 Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
 we should combine them :)
 

Happy to port patch already written to conform to your specification,
(more or less complies, other than name) you are welcome to go ahead and
do the RFC bit ?

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-23 Thread Joe Watkins
On Thu, 2014-10-23 at 12:47 -0700, Stas Malyshev wrote:
 Hi!
 
  Not ready for discussion yet ...
  
  https://wiki.php.net/rfc/hashkey
 
 Hey, I've just started my own... https://wiki.php.net/rfc/objkey I guess
 we should combine them :)
 

Done, branch @ http://github.com/krakjoe/php-src/compare/hashkey

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-22 Thread Rowan Collins
On 21 October 2014 23:21:37 GMT+01:00, Andrea Faulds a...@ajf.me wrote:

 On 21 Oct 2014, at 21:42, Rowan Collins rowan.coll...@gmail.com
wrote:
 
 The only case I can see where a default encoding would be sensible
would be where source code itself is in a different encoding, so that
u('literal string') works as expected.

This is only a good idea if we can somehow make it file-local.
Otherwise if one library uses Latin-1 and another uses UTF-8 for some
reason, bang!

Yes, I used the word declared advisedly, because I was thinking it could take 
its default encoding (if we were to go down the route of special literal syntax 
rather than wrapper-function) from the existing declare(encoding='...') 
directive, rather than a global variable or setting.

http://php.net/manual/en/control-structures.declare.php#control-structures.declare.encoding


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-22 Thread Rowan Collins


On 21 October 2014 23:21:37 GMT+01:00, Andrea Faulds a...@ajf.me wrote:

Make array-like indexing with [] be by
code points as you may be able to do that in constant time

If the internal representation is UTF8, both code point and grapheme access 
require traversal unless you have some additional index structure. Both can be 
trivialised to byte access if you have detected and stored that the string is 
entirely ASCII, but otherwise you will nearly always have multiple widths 
within one string.

If the internal representation is UTF16, code point access can be accelerated 
for any string containing only BMP characters (no surrogate pairs). The Perl6 
concept of NFG attempts to extend that advantage to grapheme access, and to 
points outside the BMP.


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-22 Thread Pierre Joye
hi,

On Tue, Oct 21, 2014 at 4:01 PM, Dmitry Stogov dmi...@zend.com wrote:
 Hi Joe,

 As an extension it looks fine.
 I assume, you don't propose to use UString objects in engine and other
 extensions.
 Unfortunately, it's yet another incomplete solution.

I have to agree here.

As much as I like what has been done here, having UString as part of
the engine or at least main/ may help tighter integration. I am also
not sure about the driver approach (have to double check it again as I
stopped following it since a couple of weeks). Having UString in the
core is a great thing anyway. However there is no mention whether it
should be always enabled or not. I think it should be always enabled,
providing the base Unicode strings features by default. Having ICU as
default dependency is not really an issue imho.

We discussed that with Joe in the early UString days but we did not
agree. Mainly because he likes to keep UString independent, unbloated
etc. I think it is possible to keep it simple and having it tightly
integrated in the core. Advanced features can be done either in intl
or in userland (if we can avoid having every single project doing its
own unicode string class... that would keep the performance
improvement along other annoying APIs differences).

 It won't allow Unicode strings as array keys;
 concatenation using . (probably may be done),
 no auto-conversion from/to script/output encoding,
 no auto-conversion of strings coming from database extensions, etc

 The right approach, would be extending zend_string with encoding and
 then adopting near all functions working with zend_string to take
 encoding into account. But, of course, this is going to lead to much more
 complicated solution (with some slowdown).

Fully agree here too.

 If we don't care about complete solution, UString proposal may make sense
 at lest as a faster replacement of ext/mbstring.

I agree here too. For one I do care about a complete solution, for the
basic Unicode features, integrated with the language.

 Thanks. Dmitry.



 On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote:

 Morning internalz,

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)

 Cheers
 Joe


 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php





-- 
Pierre

@pierrejoye | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Leigh
On 21 October 2014 08:06, Joe Watkins pthre...@pthreads.org wrote:
 Morning internalz,

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be opening 
 any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can hash 
 out
 the details, or get on with other things ;)


Breaks nothing, faster than mbstring, seems like win/win to me.

 On the flip side, implementing UString as a scalar object would be 
 inconsistent. At time of writing, array, int, float, bool, etc have no 
 implementation available for this.

I agree it shouldn't be a scalar object, but how about some operator
overloading like the GMP object has, so that you don't have to cast to
string for expected behaviour with type coercion etc.

 Right now there are user-space libraries out there that cover a lot more 
 functionality than UString.

Do you need help implementing these? Do you think it would be
beneficial to briefly list which areas need attention on the RFC, so
they can be checked off over time?

Overall +1 on the concept.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] [RFC] UString

2014-10-21 Thread Zeev Suraski
 -Original Message-
 From: Joe Watkins [mailto:pthre...@pthreads.org]
 Sent: Tuesday, October 21, 2014 10:07 AM
 To: internals@lists.php.net
 Subject: [PHP-DEV] [RFC] UString

 Morning internalz,

   https://wiki.php.net/rfc/ustring

   This is the result of work done by a few of us, we won't be opening
 any vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

   Now seems like a good time to start the conversation so we can
 hash out the details, or get on with other things ;)

+1 from me.  I think it's the right way to tackle Unicode.

Zeev

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Joe Watkins
On Tue, 2014-10-21 at 08:40 +0100, Leigh wrote:
 On 21 October 2014 08:06, Joe Watkins pthre...@pthreads.org wrote:
  Morning internalz,
 
  https://wiki.php.net/rfc/ustring
 
  This is the result of work done by a few of us, we won't be opening 
  any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.
 
  Now seems like a good time to start the conversation so we can hash 
  out
  the details, or get on with other things ;)
 
 
 Breaks nothing, faster than mbstring, seems like win/win to me.
 
  On the flip side, implementing UString as a scalar object would be 
  inconsistent. At time of writing, array, int, float, bool, etc have no 
  implementation available for this.
 
 I agree it shouldn't be a scalar object, but how about some operator
 overloading like the GMP object has, so that you don't have to cast to
 string for expected behaviour with type coercion etc.
 
  Right now there are user-space libraries out there that cover a lot more 
  functionality than UString.
 
 Do you need help implementing these? Do you think it would be
 beneficial to briefly list which areas need attention on the RFC, so
 they can be checked off over time?
 
 Overall +1 on the concept.

Morning Leigh,

ZEND_CONCAT is overloaded, as well as read_dimension and cast (to
string) handlers. This seems to cover everything, unless I missed
something ?

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Lester Caine
On 21/10/14 08:06, Joe Watkins wrote:
 Now seems like a good time to start the conversation so we can hash out
 the details, or get on with other things ;)

Does this address the problem of sorting array keys using a particular
language or collation?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Joe Watkins
On Tue, 2014-10-21 at 09:02 +0100, Lester Caine wrote:
 On 21/10/14 08:06, Joe Watkins wrote:
  Now seems like a good time to start the conversation so we can hash out
  the details, or get on with other things ;)
 
 Does this address the problem of sorting array keys using a particular
 language or collation?
 
 -- 
 Lester Caine - G8HFL
 -
 Contact - http://lsces.co.uk/wiki/?page=contact
 L.S.Caine Electronic Services - http://lsces.co.uk
 EnquirySolve - http://enquirysolve.com/
 Model Engineers Digital Workshop - http://medw.co.uk
 Rainbow Digital Media - http://rainbowdigitalmedia.co.uk
 

No.

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Nicolas Grekas
This is great thanks for the work!
I think we should have an opinion on grapheme clusters and tell about it in
the RFC.

I do support the idea that PHP users need to handle characters in term of
graphemes. We need a core way to deal with code points of course, but
things like reverse have very low value without graphemes.

toLower/toUpper also misses the turkish specifics - or is the Ustring class
locale dependent?
Should we add toCaseFold? Where are the i version of strpos, etc. Do we
want them in core PHP7? An other point we should add to the RFC.

For reference here is my grapheme cluster aware string handling:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/Utf8.php

and the same but turkish variant:
https://github.com/nicolas-grekas/Patchwork-UTF8/blob/master/class/Patchwork/TurkishUtf8.php

About unicode equivalence:
For all the string matching functions (contains, startsWith, etc.) do they
handling unicode equivalence?
How do we compare two Ustrings? Does the == operator handle unicode
equivalence? What is the way to go otherwise? Normalize is before on our
own?
The RFC should tell about it also IMHO (and tell that collation/sorting
handling is out of scope).

Complex topic :)

Cheers,
NIcolas


Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Dmitry Stogov
Hi Joe,

As an extension it looks fine.
I assume, you don't propose to use UString objects in engine and other
extensions.
Unfortunately, it's yet another incomplete solution.

It won't allow Unicode strings as array keys;
concatenation using . (probably may be done),
no auto-conversion from/to script/output encoding,
no auto-conversion of strings coming from database extensions, etc

The right approach, would be extending zend_string with encoding and
then adopting near all functions working with zend_string to take
encoding into account. But, of course, this is going to lead to much more
complicated solution (with some slowdown).

If we don't care about complete solution, UString proposal may make sense
at lest as a faster replacement of ext/mbstring.

Thanks. Dmitry.



On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org wrote:

 Morning internalz,

 https://wiki.php.net/rfc/ustring

 This is the result of work done by a few of us, we won't be
 opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

 Now seems like a good time to start the conversation so we can
 hash out
 the details, or get on with other things ;)

 Cheers
 Joe


 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php




Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Leigh
On 21 October 2014 09:01, Joe Watkins pthre...@pthreads.org wrote:

 ZEND_CONCAT is overloaded, as well as read_dimension and cast (to
 string) handlers. This seems to cover everything, unless I missed
 something ?


ZEND_CONCAT and ZEND_ASSIGN_CONCAT were my primary concerns, I didn't
see any mention of these in the RFC which is why I brought it up
(maybe it should be documented there).

May not be desirable at all, but obviously with ordinary strings we
can do `int + str containing int`, and if the UString object
contains an int then `int + (string)ustring` will still achieve that.

My thought was to make the remaining operators that don't make sense
on an object implicitly cast to string before the operation takes
place.

Feel free to do not want. :)

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Philip Hofstetter
Hello

Tangentially related:

On Tuesday, October 21, 2014, Dmitry Stogov dmi...@zend.com wrote:


 It won't allow Unicode strings as array keys;


I wish there was a way for specific objects to opt into this.

Using __toString()  we have something that mostly behaves just like a
string and can be used wherever a string is required - with the exception
of array keys.

I seem to remember some earlier discussion that led to this being
intentionally made impossible (and I understand why), but maybe there could
be support for another magic underscore method that's called when an object
is about to be put into an array as a key (or similar situations)

Philip


-- 
Sensational AG
Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
Tel. +41 43 544 09 60, Mobile  +41 79 341 01 99
i...@sensational.ch, http://www.sensational.ch


Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Florian Margaine
Hi,

@Philip: please read the discussion that happened a month ago (and follow
up on it if necessary):
http://marc.info/?l=php-internalsm=141145952422734w=2

Regards,

On Tue, Oct 21, 2014 at 11:19 AM, Philip Hofstetter 
phofstet...@sensational.ch wrote:

 Hello

 Tangentially related:

 On Tuesday, October 21, 2014, Dmitry Stogov dmi...@zend.com wrote:
 
 
  It won't allow Unicode strings as array keys;


 I wish there was a way for specific objects to opt into this.

 Using __toString()  we have something that mostly behaves just like a
 string and can be used wherever a string is required - with the exception
 of array keys.

 I seem to remember some earlier discussion that led to this being
 intentionally made impossible (and I understand why), but maybe there could
 be support for another magic underscore method that's called when an object
 is about to be put into an array as a key (or similar situations)

 Philip


 --
 Sensational AG
 Giesshübelstrasse 62c, Postfach 1966, 8021 Zürich
 Tel. +41 43 544 09 60, Mobile  +41 79 341 01 99
 i...@sensational.ch, http://www.sensational.ch




-- 
Florian Margaine


Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Joe Watkins
On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote:
 Hi Joe,
 
 
 As an extension it looks fine.
 
 I assume, you don't propose to use UString objects in engine and other
 extensions.

I'm not proposing it now, no.

 Unfortunately, it's yet another incomplete solution.
 
 It won't allow Unicode strings as array keys;

The engine doesn't allow that, couldn't we find a way of using objects
as array keys ?? It doesn't seem like a limitation of the extension, to
me ;)

 concatenation using . (probably may be done),

That's already done.

 no auto-conversion from/to script/output encoding,

That could be arranged.

 no auto-conversion of strings coming from database extensions, etc

I'm not sure how important that is, it's not a big deal to create a new
object, nor would it be a big deal for those extensions that need to
always return unicode strings to do so.
 
 The right approach, would be extending zend_string with encoding
 and then adopting near all functions working with zend_string to take
 encoding into account. But, of course, this is going to lead to much
 more complicated solution (with some slowdown).

That seems a lot like bashing our head against a wall. We tried to
introduce support everywhere and it fails. Do we really want to step on
the performance gains introduced by recent changes by making all strings
unicode ?

That doesn't seem like a sensible thing to want, at least right now.

Having UString doesn't stop us approaching the problem differently in
the future, but it would have to be a very different future to even make
sense to me.

 If we don't care about complete solution, UString proposal may make
 sense at lest as a faster replacement of ext/mbstring.

As the RFC states, we are only approaching one problem, the problem that
ext/mbstring is not a good API.
 
 Thanks. Dmitry.


 On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org
 wrote:
 Morning internalz,
 
 https://wiki.php.net/rfc/ustring
 
 This is the result of work done by a few of us, we
 won't be opening any
 vote in a fortnight. We have a long time before 7, there is no
 rush
 whatever.
 
 Now seems like a good time to start the conversation
 so we can hash out
 the details, or get on with other things ;)
 
 Cheers
 Joe
 
 
 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php

Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Dmitry Stogov
On Tue, Oct 21, 2014 at 1:25 PM, Joe Watkins pthre...@pthreads.org wrote:

 On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote:
  Hi Joe,
 
 
  As an extension it looks fine.
 
  I assume, you don't propose to use UString objects in engine and other
  extensions.

 I'm not proposing it now, no.

  Unfortunately, it's yet another incomplete solution.
 
  It won't allow Unicode strings as array keys;

 The engine doesn't allow that, couldn't we find a way of using objects
 as array keys ?? It doesn't seem like a limitation of the extension, to
 me ;)

  concatenation using . (probably may be done),

 That's already done.

  no auto-conversion from/to script/output encoding,

 That could be arranged.

  no auto-conversion of strings coming from database extensions, etc

 I'm not sure how important that is, it's not a big deal to create a new
 object, nor would it be a big deal for those extensions that need to
 always return unicode strings to do so.
 
  The right approach, would be extending zend_string with encoding
  and then adopting near all functions working with zend_string to take
  encoding into account. But, of course, this is going to lead to much
  more complicated solution (with some slowdown).

 That seems a lot like bashing our head against a wall. We tried to
 introduce support everywhere and it fails. Do we really want to step on
 the performance gains introduced by recent changes by making all strings
 unicode ?


Yeah :)
I'm not sure, if it should be done, and I don't like to work on it in the
nearest future, but zend_string approach should be easier to implement than
separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6.


 That doesn't seem like a sensible thing to want, at least right now.

 Having UString doesn't stop us approaching the problem differently in
 the future, but it would have to be a very different future to even make
 sense to me.


Agree.



  If we don't care about complete solution, UString proposal may make
  sense at lest as a faster replacement of ext/mbstring.

 As the RFC states, we are only approaching one problem, the problem that
 ext/mbstring is not a good API.


Then, it's fine.

One note regarding implementation: why do you use C++ for ustring.cpp? I
understand it's necessary for ICU backend, but if in the future you might
switch to another backend (and it may not require C++) why to use C++ for
PHP extension part?

Thanks. Dmitry.


 
  Thanks. Dmitry.


  On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins pthre...@pthreads.org
  wrote:
  Morning internalz,
 
  https://wiki.php.net/rfc/ustring
 
  This is the result of work done by a few of us, we
  won't be opening any
  vote in a fortnight. We have a long time before 7, there is no
  rush
  whatever.
 
  Now seems like a good time to start the conversation
  so we can hash out
  the details, or get on with other things ;)
 
  Cheers
  Joe
 
 
  --
  PHP Internals - PHP Runtime Development Mailing List
  To unsubscribe, visit: http://www.php.net/unsub.php

 Cheers
 Joe




Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Joe Watkins
On Tue, 2014-10-21 at 13:52 +0400, Dmitry Stogov wrote:
 
 
 On Tue, Oct 21, 2014 at 1:25 PM, Joe Watkins pthre...@pthreads.org
 wrote:
 On Tue, 2014-10-21 at 13:01 +0400, Dmitry Stogov wrote:
  Hi Joe,
 
 
  As an extension it looks fine.
 
  I assume, you don't propose to use UString objects in engine
 and other
  extensions.
 
 I'm not proposing it now, no.
 
  Unfortunately, it's yet another incomplete solution.
 
  It won't allow Unicode strings as array keys;
 
 The engine doesn't allow that, couldn't we find a way of using
 objects
 as array keys ?? It doesn't seem like a limitation of the
 extension, to
 me ;)
 
  concatenation using . (probably may be done),
 
 That's already done.
 
  no auto-conversion from/to script/output encoding,
 
 That could be arranged.
 
  no auto-conversion of strings coming from database
 extensions, etc
 
 I'm not sure how important that is, it's not a big deal to
 create a new
 object, nor would it be a big deal for those extensions that
 need to
 always return unicode strings to do so.
 
  The right approach, would be extending zend_string with
 encoding
  and then adopting near all functions working with
 zend_string to take
  encoding into account. But, of course, this is going to
 lead to much
  more complicated solution (with some slowdown).
 
 That seems a lot like bashing our head against a wall. We
 tried to
 introduce support everywhere and it fails. Do we really want
 to step on
 the performance gains introduced by recent changes by making
 all strings
 unicode ?
 
 
 Yeah :)

You must like punishment :D
 
 I'm not sure, if it should be done, and I don't like to work on it in
 the nearest future, but zend_string approach should be easier to
 implement than separate IS_UNICODE + IS_STRING + IS_BINARY types in
 PHP6.
 
The implementation might be simpler, but the effect the same I think.

I can be wrong, but nothing has so drastically changed that will allow
us to absorb the kind of impact I think you are talking about.

  
 
 That doesn't seem like a sensible thing to want, at least
 right now.
 
 Having UString doesn't stop us approaching the problem
 differently in
 the future, but it would have to be a very different future to
 even make
 sense to me.
 
 
 Agree.
  
 
 
  If we don't care about complete solution, UString proposal
 may make
  sense at lest as a faster replacement of ext/mbstring.
 
 As the RFC states, we are only approaching one problem, the
 problem that
 ext/mbstring is not a good API.
 
 
 Then, it's fine.
 
 One note regarding implementation: why do you use C++ for ustring.cpp?
 I understand it's necessary for ICU backend, but if in the future you
 might switch to another backend (and it may not require C++) why to
 use C++ for PHP extension part? 

Totally possible that we'll have to change, or that we should change. A
few people have said they would like to write a backend so we'll see
what comes in and where that leads us.


 
 Thanks. Dmitry.
  
 
 
  Thanks. Dmitry.
 
 
  On Tue, Oct 21, 2014 at 11:06 AM, Joe Watkins
 pthre...@pthreads.org
  wrote:
  Morning internalz,
 
  https://wiki.php.net/rfc/ustring
 
  This is the result of work done by a few of
 us, we
  won't be opening any
  vote in a fortnight. We have a long time before 7,
 there is no
  rush
  whatever.
 
  Now seems like a good time to start the
 conversation
  so we can hash out
  the details, or get on with other things ;)
 
  Cheers
  Joe
 
 
  --
  PHP Internals - PHP Runtime Development Mailing List
  To unsubscribe, visit: http://www.php.net/unsub.php
 
 Cheers
 Joe


Cheers
Joe


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Lester Caine
On 21/10/14 10:52, Dmitry Stogov wrote:
 That seems a lot like bashing our head against a wall. We tried to
  introduce support everywhere and it fails. Do we really want to step on
  the performance gains introduced by recent changes by making all strings
  unicode ?
 
 Yeah :)
 I'm not sure, if it should be done, and I don't like to work on it in the
 nearest future, but zend_string approach should be easier to implement than
 separate IS_UNICODE + IS_STRING + IS_BINARY types in PHP6.

Isn't this the first discussion?

If we are going down the root of keeping PHP7 as ascii only in the core,
then ustring probably makes sense, but it does not address many of the
areas where unicode is really needed. Handling unicode content outside
the core is working reasonably at the moment, it is the problems such as
using unicode keys for arrays which is the main area where unicoe is
needed in PHP7 and so a more embedded handling is needed which may cut
across yet another content wrapper?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Matteo Beccati
On 21/10/2014 09:06, Joe Watkins wrote:
 Morning internalz,
 
   https://wiki.php.net/rfc/ustring
 
   This is the result of work done by a few of us, we won't be opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.
 
   Now seems like a good time to start the conversation so we can hash out
 the details, or get on with other things ;)

Nice job!

However, doesn't ICU use UTF-16 by default which is undesirable as most
of the times it requires converting from and to UTF-8?


Cheers
-- 
Matteo Beccati

Development  Consulting - http://www.beccati.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Rowan Collins

Lester Caine wrote (on 21/10/2014):

If we are going down the root of keeping PHP7 as ascii only in the core,
then ustring probably makes sense, but it does not address many of the
areas where unicode is really needed.


Just a quick point: most of the core is not ASCII. PHP strings are byte 
strings, completely divorced from any encoding. A few native functions 
assume ISO8859-1 (or possibly Windows CP1252), but mostly they just 
juggle which ever bytes you give them.


The main exception I can think of is that numbers are often handled 
specially, with digits and separators as defined by ASCII. But since 
we're talking UTF-8, that doesn't need to change.



Handling unicode content outside
the core is working reasonably at the moment, it is the problems such as
using unicode keys for arrays which is the main area where unicoe is
needed in PHP7 and so a more embedded handling is needed which may cut
across yet another content wrapper?


I do think this is an important thing to consider, though. If this 
extension is genuinely just meant as a more modern and more performant 
way of doing things which mbstring and intl can already do, that needs 
to be clear in the way it's documented and publicised. If this gets 
publicised as better Unicode support, users are naturally going to 
expect UString objects to start appearing in core, and in other 
extensions, and be disappointed that it's still just a toolbox for their 
own string handling.


--
Rowan Collins
[IMSoP]

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Lester Caine
On 21/10/14 12:11, Rowan Collins wrote:
 Lester Caine wrote (on 21/10/2014):
 If we are going down the root of keeping PHP7 as ascii only in the core,
 then ustring probably makes sense, but it does not address many of the
 areas where unicode is really needed.
 
 Just a quick point: most of the core is not ASCII. PHP strings are byte
 strings, completely divorced from any encoding. A few native functions
 assume ISO8859-1 (or possibly Windows CP1252), but mostly they just
 juggle which ever bytes you give them.
 
 The main exception I can think of is that numbers are often handled
 specially, with digits and separators as defined by ASCII. But since
 we're talking UTF-8, that doesn't need to change.

Pierre had proposed restricting that to ascii as a way of addressing the
inconsistencies that arise because some areas do not currently make a
distinction.

 Handling unicode content outside
 the core is working reasonably at the moment, it is the problems such as
 using unicode keys for arrays which is the main area where unicoe is
 needed in PHP7 and so a more embedded handling is needed which may cut
 across yet another content wrapper?
 
 I do think this is an important thing to consider, though. If this
 extension is genuinely just meant as a more modern and more performant
 way of doing things which mbstring and intl can already do, that needs
 to be clear in the way it's documented and publicised. If this gets
 publicised as better Unicode support, users are naturally going to
 expect UString objects to start appearing in core, and in other
 extensions, and be disappointed that it's still just a toolbox for their
 own string handling.

This is where a proper discussion on just what is trying to be achieved
is important, before discussing tangents?

-- 
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk
Rainbow Digital Media - http://rainbowdigitalmedia.co.uk

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Christian Schneider
Am 21.10.2014 um 09:06 schrieb Joe Watkins pthre...@pthreads.org:
   https://wiki.php.net/rfc/ustring
 
   This is the result of work done by a few of us, we won't be opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.


I have one concern I want to bring up: The RFC proposes a helper function u() 
to generate UStrings.

As this is a very handy function name for all sort of utility functions (as a 
matter of face we use it to create and sanitize URL strings to be embedded into 
HTML) I would assume that more than one project has a name clash there.

Maybe something like _u() could be used instead? Or do you have better 
alternatives for this?

PS: UString is also in the global name space but should be less of a problem 
I'd imagine.

Regards,
- Chris


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Michael Wallner
On 21 October 2014 14:35, Christian Schneider cschn...@cschneid.com wrote:

 Am 21.10.2014 um 09:06 schrieb Joe Watkins pthre...@pthreads.org:
https://wiki.php.net/rfc/ustring
 
This is the result of work done by a few of us, we won't be
 opening any
  vote in a fortnight. We have a long time before 7, there is no rush
  whatever.


 I have one concern I want to bring up: The RFC proposes a helper function
 u() to generate UStrings.

 As this is a very handy function name for all sort of utility functions
 (as a matter of face we use it to create and sanitize URL strings to be
 embedded into HTML) I would assume that more than one project has a name
 clash there.

 Maybe something like _u() could be used instead? Or do you have better
 alternatives for this?

 PS: UString is also in the global name space but should be less of a
 problem I'd imagine.


With the use function support, that could be located in a namespace.

But something else: wasn't there a big concern in another thread regarding
codepoint/grapheme support, like with $ustring-length()?


-- 
Regards,
Mike


Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Andrea Faulds

 On 21 Oct 2014, at 13:35, Christian Schneider cschn...@cschneid.com wrote:
 
 I have one concern I want to bring up: The RFC proposes a helper function u() 
 to generate UStrings.
 
 As this is a very handy function name for all sort of utility functions (as a 
 matter of face we use it to create and sanitize URL strings to be embedded 
 into HTML) I would assume that more than one project has a name clash there.
 
 Maybe something like _u() could be used instead? Or do you have better 
 alternatives for this?
 
 PS: UString is also in the global name space but should be less of a problem 
 I'd imagine.

I think we should reserve some way to do Unicode strings. I’d want u”foo”, but 
we’re not adding literals, so u(“foo”) it is.

Also, bear in mind that namespaces mean you can still have your own u() if it’s 
in your namespace (\u).
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Andrea Faulds
So, one thing which I think is worth bringing up is code points vs. 
characters/graphemes.

This came up in another recent thread about Unicode on internals. While 
code-point manipulation is all well and good, we also need grapheme 
manipulation functions. Could we add these? That would make the API more useful.

On that note, -charAt ought to be -codepointAt to avoid being misleading.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Matteo Beccati
On 21/10/2014 15:17, Lester Caine wrote:
 On 21/10/14 11:50, Matteo Beccati wrote:
 However, doesn't ICU use UTF-16 by default which is undesirable as most
 of the times it requires converting from and to UTF-8?
 
 http:// userguide.icu-project.org/strings/utf-8
 It is interesting that the earlier adoption of UTF-16 still prevails,
 but switching to UTF-8 is becoming the norm?

Yes, as far as I knew using UTF-8 by default was a compile-time option
for ICU, that most of the times comes from system packages.


Cheers
-- 
Matteo Beccati

Development  Consulting - http://www.beccati.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Sara Golemon

 On Oct 21, 2014, at 0:06, Joe Watkins pthre...@pthreads.org wrote:
 
 Morning internalz,
 
https://wiki.php.net/rfc/ustring
 
This is the result of work done by a few of us, we won't be opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.
 

The backend abstraction seems overengineered to me.  It could also lead to 
inconsistencies in behavior if ICU and Windows implement something in subtly 
different ways.

Since we're linking ICU for the rest of the intl extension anyway, it seems to 
me like we should just focus on it as an ICU wrapper.

Also, I'd peopose a minor ammendment to this RFC that other intl classes be 
extended to support taking UString instances as arguments (avoiding the 
implicit conversion to UTF8). That work doesn't have to gate adoption of the 
base implementation, it'd just be useful to decide at the same time if we want 
to do so.

-Sara
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Stas Malyshev
Hi!

   https://wiki.php.net/rfc/ustring
 
   This is the result of work done by a few of us, we won't be opening any
 vote in a fortnight. We have a long time before 7, there is no rush
 whatever.

Couple of thoughts:
- I like the idea of having a unicode string class. May be a way to
figure out the right way to do it without messing up the whole core.

- I wish there were more description of which API this class provides.
If it's planned to be direct copy of UnicodeString, some of the
operations there are not how PHP strings usually work (i.e. in-place
modification) and it's not really enough to make it useful - e.g. what
if I need to do regexps on it, for example? Or does it cover whole
mbstring API too? What about something mbstring doesn't cover, like
ucfirst or strrev?

- Do we really need different encodings, different backends and so on,
internally? Note that each backend has its own quirks, limitations and
bugs, and there's nothing worse than dealing with unpredictable set of
dependencies. The user cares what they send into the class and what
comes out, but very rarely they care what happens inside - why not just
do it one way everywhere?

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Stas Malyshev
Hi!

 I wish there was a way for specific objects to opt into this.

There will be, if __hashKey() or whatever would be the properly
bikeshedded name, becomes reality as discussed elsewhere. It shouldn't
be hard to do and it's exactly what many other languages do when trying
to use objects as keys for maps.


-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Stas Malyshev
Hi!

 Just a quick point: most of the core is not ASCII. PHP strings are byte 
 strings, completely divorced from any encoding. A few native functions 
 assume ISO8859-1 (or possibly Windows CP1252), but mostly they just 
 juggle which ever bytes you give them.

True, but not all extensions and functions behave this way. Some
(especially with intl, but not only) assume it's utf-8, for example, and
for some utf-8 is a changeable default, which in practice often becomes
the used encoding since people are not aware of need to track their
encoding and most of them do use utf-8 anyway.

 The main exception I can think of is that numbers are often handled 
 specially, with digits and separators as defined by ASCII. But since 
 we're talking UTF-8, that doesn't need to change.

More interesting case actually is, well, case conversion. We unknowingly
used locale-dependent lowercasing routines until the inevitable
encounter with the dreaded Turkish 'i'. At which point we switched to
forced ASCII. So identifiers in the engine are kind of assumed to be
ASCII, even though you can somethimes sneak non-ASCII past it and it
will work, but weirdly.

-- 
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Rowan Collins

On 21/10/2014 08:06, Joe Watkins wrote:

Morning internalz,

https://wiki.php.net/rfc/ustring

This is the result of work done by a few of us, we won't be opening any
vote in a fortnight. We have a long time before 7, there is no rush
whatever.

Now seems like a good time to start the conversation so we can hash out
the details, or get on with other things ;)

Cheers
Joe




I think this looks like a really great start at creating something 
actually useful, rather than getting stuck at the drawing board. I like 
that the scope is quite small initially - where does the single 
responsibility of a class that represents a string end, anyway? :)


A few opinions:

1) Global / static defaults are bad.

The existence of the setDefaultCodepage method feels like an 
anti-pattern to me. It means libraries can't rely on this class working 
the same way in two different host environments, or even at two 
re-entries in the same program. Effectively, if you don't know what the 
second argument to the constructor will default to, you can't actually 
treat it as optional unless you're writing monolithic code. This is a 
common pattern in PHP, but http_build_query() would be so much more 
pleasant if I could safely call it with 1 argument instead of 3.


I think the default should be hard-coded to UTF-8, which according to 
previous discussion is always the default *output* encoding, so would 
mean this would always work: $aUString = new UString( (string)$aUString 
); Any other encoding will be dependent on, and known from, the context 
where the object is created - if grabbing data from an HTTP request, a 
header should tell them; if from a database, a connection parameter; and 
so on.


The only case I can see where a default encoding would be sensible would 
be where source code itself is in a different encoding, so that 
u('literal string') works as expected. I guess if we ever went down the 
route of special literal syntax like u'literal string', the declared 
source encoding could be used.


Actually, the u() shortcut function appears to be missing the encoding 
parameter completely; is this deliberate?


2) Clarify relationship to a byte string

Most of the API acts like this is an abstract object representing a 
bunch of Unicode code points. As such, I'm not sure what getCodepage() 
does - a code page (or more properly encoding) is a property of a stream 
of bytes, so has no meaning in this context, surely? The internal 
implementation could use UTF-8, UTF-16, or some made-up encoding (like 
Perl6's NFG system) and the user should never need to know (other than 
to understand performance implications).


On the other hand, when you *do* want a stream of bytes, the class 
doesn't seem to have an explicit way to get one. The (currently 
undocumented) behaviour is apparently to spit out UTF-8 if cast to a 
string, but it would be nice to have an explicit function which could be 
passed a parameter in order to serialise to, say, UTF-16, instead.


3) The Grapheme Question

This has been raised a few times, so I won't labour the point, just 
mention my current thinking.


Unicode is complicated. Partly, that's because of a series of 
compromises in its design; but partly, it's because writing systems are 
complicated, and Unicode tries harder than most previous systems to 
acknowledge that. So, there's a tradeoff to be made between giving users 
what they think they need, thus hiding the messy details, and giving 
users the power to do things right, in a more complex way.


There is also a namespace mess if you insist on every function and 
property having to declare what level of abstraction it's talking about 
- e.g. $codePointLength instead of $length.


An idea I've been toying with is rather than having one class 
representing the slippery notion of a Unicode string, having (at 
least) two, closely tied, classes: CodePointString (roughly = UString 
right now) and GraphemeString (a higher level abstraction tied to the 
same internal representation).


I intend to mock this up as a set of interfaces at some point, but the 
basic idea is that you could write this:


// Get an abstract object from a byte string, probably a GraphemeString, 
parsing the input as UTF-8

$str = u('some text');
// Perform an operation that explicitly deals in Code Points
$str = $str-asCodePoints()-normalise('NFC');
// Get information using a higher level of abstraction
$length = $str-asGraphemes()-length;
// Perform a high-level mutation, then convert right back to a concrete 
string of bytes

echo $str-asGraphemes()-reverse()-asByteString('UTF-16');

Calling asGraphemes() on a GraphemeString or asCodePoints() on a 
CodePointString would be legal but a no-op, so it would be safe to 
accept both as input to a function, then switch to whichever level the 
task required.


I'm not sure if this finds a good balance between complexity and 
user-friendliness, and would welcome anyone's thoughts.


--
Rowan Collins

Re: [PHP-DEV] [RFC] UString

2014-10-21 Thread Andrea Faulds

 On 21 Oct 2014, at 21:42, Rowan Collins rowan.coll...@gmail.com wrote:
 
 The only case I can see where a default encoding would be sensible would be 
 where source code itself is in a different encoding, so that u('literal 
 string') works as expected.

This is only a good idea if we can somehow make it file-local. Otherwise if one 
library uses Latin-1 and another uses UTF-8 for some reason, bang!

 2) Clarify relationship to a byte string
 
 Most of the API acts like this is an abstract object representing a bunch of 
 Unicode code points. As such, I'm not sure what getCodepage() does - a code 
 page (or more properly encoding) is a property of a stream of bytes, so has 
 no meaning in this context, surely? The internal implementation could use 
 UTF-8, UTF-16, or some made-up encoding (like Perl6's NFG system) and the 
 user should never need to know (other than to understand performance 
 implications).
 
 On the other hand, when you *do* want a stream of bytes, the class doesn't 
 seem to have an explicit way to get one. The (currently undocumented) 
 behaviour is apparently to spit out UTF-8 if cast to a string, but it would 
 be nice to have an explicit function which could be passed a parameter in 
 order to serialise to, say, UTF-16, instead.

I agree on both these points. -toBytes or -encode with an explicit charset 
parameter would be good. I don’t see the point of getCodepage().

 3) The Grapheme Question
 
 This has been raised a few times, so I won't labour the point, just mention 
 my current thinking.
 
 Unicode is complicated. Partly, that's because of a series of compromises in 
 its design; but partly, it's because writing systems are complicated, and 
 Unicode tries harder than most previous systems to acknowledge that. So, 
 there's a tradeoff to be made between giving users what they think they need, 
 thus hiding the messy details, and giving users the power to do things right, 
 in a more complex way.
 
 There is also a namespace mess if you insist on every function and property 
 having to declare what level of abstraction it's talking about - e.g. 
 $codePointLength instead of $length.
 
 An idea I've been toying with is rather than having one class representing 
 the slippery notion of a Unicode string, having (at least) two, closely 
 tied, classes: CodePointString (roughly = UString right now) and 
 GraphemeString (a higher level abstraction tied to the same internal 
 representation).
 
 I intend to mock this up as a set of interfaces at some point, but the basic 
 idea is that you could write this:
 
 // Get an abstract object from a byte string, probably a GraphemeString, 
 parsing the input as UTF-8
 $str = u('some text');
 // Perform an operation that explicitly deals in Code Points
 $str = $str-asCodePoints()-normalise('NFC');
 // Get information using a higher level of abstraction
 $length = $str-asGraphemes()-length;
 // Perform a high-level mutation, then convert right back to a concrete 
 string of bytes
 echo $str-asGraphemes()-reverse()-asByteString('UTF-16');
 
 Calling asGraphemes() on a GraphemeString or asCodePoints() on a 
 CodePointString would be legal but a no-op, so it would be safe to accept 
 both as input to a function, then switch to whichever level the task required.
 
 I'm not sure if this finds a good balance between complexity and 
 user-friendliness, and would welcome anyone's thoughts.

I’d rather have some grapheme-specific functions and some code point functions 
on the same class. Make array-like indexing with [] be by code points as you 
may be able to do that in constant time, and because there might be multiple 
approaches to choosing graphemes. Have -codepointAt(), but also 
-nthGrapheme() or something like it. There’s no need for grapheme versions of 
all functions, but others would need them.

Though your approach has its own merits.
--
Andrea Faulds
http://ajf.me/





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php