#22108 [WFx->Asn]: php doesn't ignore the utf-8 BOM

moriyoshi Wed, 04 Jun 2003 17:47:27 -0700

 ID:               22108
 Updated by:       [EMAIL PROTECTED]
 Reported By:      bugzilla at jellycan dot com
-Status:           Wont fix
+Status:           Assigned
 Bug Type:         Feature/Change Request
 Operating System: Any
 PHP Version:      All (as of the current implementation)
 Assigned To:      moriyoshi
 New Comment:


Derick,

Please do not change the status of the bug that is already assigned to
someone.

There's no point that PHP can only handle ASCII documents because if
you want to use German in PHP for example, at least you have to use
ISO-8859-1 or ISO-8859-15, which is not even part of ASCII.



Previous Comments:
------------------------------------------------------------------------

[2003-06-03 14:17:22] [EMAIL PROTECTED]

Feel free to rewrite the parser, but that's just not going to happen.
We want ascii import, not unicode.

------------------------------------------------------------------------

[2003-06-03 14:07:16] gump at hotmail dot com

> [8 Feb 4:24am CST] [EMAIL PROTECTED]

> PHP doesn't want UNICODE scripts, but just ASCII ones. Not 
> a bug -> bogus.

Not bogus.  

PHP is embedded in HTML, the surrounding document determines the
encoding.  You can't just specify this problem out of existence.

------------------------------------------------------------------------

[2003-05-05 03:40:23] tokiee at sayclub dot com

for who are not familiar with UTF-8:

UTF-8(UCS Transformation Format 8) is not different to ASCII. it's
compatible with the ASCII: if you write your text in english with
UTF-8. you dont see any difference between the text in ASCII in each
byte. (and UTF-8 BOM is optional).

it's not quite a exact explanation of UTF-8 but: UTF-8 expands ASCII to
support Full UNICODE characters without disurbing any existing alphabet
order or something. so basically the UTF-8 is ASCII. and you dont have
to imagine it as totally new freak.

actually, when a modern Unicode-supported OS reads this UTF-8, the OS
needs to CONVERT it to real UNICODE internally. so the UTF-8 is rather
similar with URL encoding.

in ASCII world, each byte corresponds a character, up to 255
characters.

in UNICODE, two bytes corresponds a character, up to 65535 characters.
and it's totally a new system as you think.

in UTF-8, it's interesting, a character can be one byte, or two bytes,
or even 3, 4 bytes!. why is that so complicated but the rule is simple
and actually you dont have to handle this: OS will do it for you. 

even if you have any software which does not understand the utf-8, it's
totally okay because it's ASCII transparent. so it "can be used with
normal string comparison functions for sorting and such." (quoted in
PHP.NET Reference: utf8_encode())

------------------------------------------------------------------------

[2003-04-14 12:17:37] [EMAIL PROTECTED]

As a short-term workaround (yes I know it's not a solution), can you
try using output buffering?  That should at least solve the problem of
sneaking the headers in prior to the BOM even if it doesn't solve the
underlying problem of recoginizing document encodings properly.

------------------------------------------------------------------------

[2003-04-06 00:53:04] tronxoe at hotpop dot com

The BOM is still fine when the php file does not include another
Unicode file (by using @include()).

Another problem: If a php file is saved in unicode,  session and
cookies can not be used because "headers already sent ...". I think the
first 3 bytes has been sent in this case

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1

#22108 [WFx->Asn]: php doesn't ignore the utf-8 BOM

Reply via email to