On Tue, Nov 25, 2014 at 02:41:48PM +0400, Dmitry Stogov wrote:

> I'm not completely against it. It's just an incomplete solution.
> 
> echo "\u{1F602}"; // won't output 😂 if the output encoding is not UTF-8
> 
> echo "Привет \u{1F602}"; // won't output anything useful if script
> encoding is not UTF-8
> 
> The second problem present even for European counties that use Windows-1250
> codepage.

I think that we need to clarify what we are talking about.

What Andrea has proposed is a way of writing string constants. These characters
in these strings will still be 8 bits big, this means that there needs to be
some way of encoding characters with code points that will not fit in 8 bits.
The only way of avoiding that would be to use, internally, 32 bit characters --
which would be a huge change.

So: we need to have some form of encoding.

As I started ''a way of writing string constants'' - ie a *compile* time action.

With the code below it is likely that at *run-time* mb_internal_encoding() has
been called before the echo is executed or the 'Content-Type:' header specifies
some encoding.

> echo "mañana \u{1F602}"; // won't output anything useful if script
> encoding is not UTF-8

This is not something that the compiler can guess.

It is even worse if my proposal of \U{arabic letter alef} types is added, how is
that encoded ? UTF-8 or iso-8859-6 or .... ?

So, how do we fix the problem ?

* mb_internal_encoding($new_encoding) finds every string (variable and constant)
  and converts from the previous encoding to the $new_encoding.

  Possible, but horribly slow and would prob break things (eg strings that
  contain binary data).

  Not a good idea.

* Decide that UTF-8 is king.
  That is what I have decided - but I do not have any legacy code to worry about
  -- being a Brit I don't have to worry much.

* Rely on the programmer to understand encoding and know what the eventual
  output encoding will be and if it is not UTF-8 write characters using \Xxx or
  use mb_convert_encoding($string, $output_encoding, 'utf-8').

If we decide to support non-utf-8 encoding at compile time then we could extend
the syntax a bit to allow the encoding to be specified, eg:

    \U{utf-8: arabic letter alef}

    \U{iso-8859-6: arabic letter alef}

Ie, allow this to be optionally specified and terminated by ':'. If not
specified then assume utf-8.


-- 
Alain Williams
Linux/GNU Consultant - Mail systems, Web sites, Networking, Programmer, IT 
Lecturer.
+44 (0) 787 668 0256  http://www.phcomp.co.uk/
Parliament Hill Computers Ltd. Registration Information: 
http://www.phcomp.co.uk/contact.php
#include <std_disclaimer.h>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to