Just to put my two yen in..

Andrei Zmievski wrote:


I haven't posted this on the internals list as yet, do you guys have
any comments/suggestions ?

[1] string http_build_query(mixed formdata [, string prefix [, string
arg_separator]])
Generates a form-encoded query string from an associative array or
object - uses urlencode() mentioned below.


[2] mixed parse_url(string url, [int url_component])
Split URL into components: username, password, hostname, port, etc.

None specifies in what encoding an URI to be represented, while an IRI's not.
So we'd better have the parameter to specify the encoding, which falls back
to UTF-8 when not given.


[3] string urlencode(string str)
[4] string urldecode(string str)
urlencode() replaces non-alphanumerics (except for hyphen, underscore
& period) with equivalent 2-digit hex escape sequences of the form
%xx. Space is replaced with plus(+).

The same argument as above applies to these.


[5] string rawurlencode(string str)
[6] string rawurldecode(string str)
rawurlencode() replaces non-alphanumerics (except for hyphen,
underscore & period) with equivalent 2-digit hex escape sequences of
the form %xx.

A couple of pblms in converting [3]-[6] above to handle Unicode:
(1) 2-digit hex sequences don't cover the range of Unicode codepts.
(2) The existing code has #define sections to handle EBCDIC and ASCII
input.

Ditto.



[7] string base64_encode(string str)
[8] string base64_decode(string str)
Implement base64 MIME

Is it correct to extend [7]&[8] above to support Unicode simply by
changing the iteration over the input string data ? Or should an
alternate transfer encoding method (quoted MIME ?) be used ?

The string should be handled as a binary sequence definitely. If the string
is Unicod'ized, base64_encode() then has to convert into a representational encoding
(possibly assumable as script_encoding?) and encode in base64, and
base64_decode() has to decode it and convert from a representational encoding into
Unicode likewise.



I had also posted the foll Q last week on the internals list, but
didn't get any responses. Any comments as to correct approach ?

[3] string addcslashes(string text, string charlist)
[4] string stripcslashes(string text)
Escape chars < 32 or > 126 with octal sequences, and escape
characters from charlist with backspace.


Escaping chars/codepts with values > 126 is a pblm in Unicode
strings. Using the 3-digit octal escape sequence, only the first
0x1FF codepts will be escaped. One soln is to only escape values < 32
with the 3-digit octal sequence. Or use hex sequences for escaping
everything.

I think escaping the original Unicode string is somewhat pointless and
it has to be done on the byte sequence converted to the intended
encoding (script_encoding / output_encoding) instead. Thus the mentioned
issue wouldn't exist basically.

Moriyoshi

--
小泉 守義 <[EMAIL PROTECTED]>

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to