Andrei Zmievski wrote:
> Your point about writing portable Unicode-friendly code is well taken.
> Rasmus and I have chatted a bit here, and we think we can propose some
> changes that may make it easier.
> 
> With unicode_semantics=off:
>  * (unicode) cast converts binary strings to Unicode strings using
> runtime_encoding setting
>  * (string) converts Unicode strings to binary strings using
> runtime_encoding again

Will a program always be able to change the runtime_encoding setting?

Some hosts like to lock off everything and disable ini_set etc. If the host has
hardlocked it at something terrible, can my portable program still declare that
it needs to work with UTF-8?

Which brings to mind; if the input in $_REQUEST etc has been misconverted by a
bad setting, how do I get at the unconverted data to fix it? The (outdated ;)
README says this will be possible but I didn't see any reference to how.

>  * Binary and Unicode strings cannot be concatenated. You have to cast
> all operands to the same type.

I do find the FATAL ERRORS on using the 'wrong' string type a bit odd though;
most other types in PHP will coerce silently (string . int), and the wildly
incompatible ones usually cause mere NOTICE or WARNING-level messages.

Was this change from PHP's regular behavior a conscious decision to make people
think harder about what kind of strings they're using? From the original design
document I got the impression that it was meant to be specific to special
binary-only strings, which would be used relatively rarely (eg for binary file
I/O) while more typical strings would transparently "just work" most of the
time. Now the binary strings have replaced the native strings and the whole
behavior has changed.

(A comparison with other languages; Python is normally very strict about typing
and won't even let you concatenate a string with an integer without an explicit
conversion. But it will let you concatenate a byte string with a Unicode string,
with an automatic coercion to Unicode.)

> With unicode_semantics=on:
>  * (unicode) cast converts binary strings to Unicode strings. The issue
> here is whether to use script_encoding (in case you do (unicode)b"blah")
> or runtime_encoding (in case it's a binary string that came from elsewhere)

Another thing you might consider is allowing only ASCII character literals in a
b"blah" binary string literal. Escape codes are available...

> I think this will make it easier to write code, because you can always
> depend on the behavior of the cast operators. The (unicode) and (string)
> casts are basically shortcuts for unicode_encode() and unicode_decode()
> used with runtime_encoding setting (excepting the issue I mentioned above).

Reliable casts would indeed be great. :)

> The unicode_semantics switch will not be per-request, due to a variety
> of reasons we have covered before.
> 
> Your suggestion about treating all string literals as Unicode if an
> encoding pragma is used is an interesting one and merits more discussion
> I think. Do you think it should affect only literals or also identifiers?

Personally I have no use for non-ASCII identifiers.

Anything that needs to get used for referring to identifiers, though, needs to
be able to operate consistently in some fashion...
* array_map("some_function_name", $data);
* $GLOBALS["myConfigVar"] = $newval;
etc

These probably need to either 'just work' when passed the other kind of string,
or have some kind of consistent cast available.

(Life would be a lot simpler if there weren't two different modes, of course. :)

-- brion vibber (brion @ pobox.com)

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to