ID:               22108
 Comment by:       gergely77 at hotmail dot com
 Reported By:      bugzilla at jellycan dot com
 Status:           Assigned
 Bug Type:         Feature/Change Request
 Operating System: *
 PHP Version:      *
 Assigned To:      moriyoshi
 New Comment:

This bug can be circumvented by using a hex editor to delete the first
3 bytes of the file thereby removing the Windows inserted UTF-8 BOM
(Byte Order Mark).  Frhed, a free GNU GPL'ed hex editor's source and
executable Windows binary can be found at
(http://www.kibria.de/frhed.html)


Previous Comments:
------------------------------------------------------------------------

[2003-11-09 16:12:50] a9c83cd8bb41db324db5b449352f183 at arcor dot de

Thought about it... Now I think it's better when the BOM isn't part of
the output because that would cause trouble if you want to output
images or PDF or something like that...

------------------------------------------------------------------------

[2003-11-08 06:45:22] a9c83cd8bb41db324db5b449352f183 at arcor dot de

I think the best would be that PHP recognizes the BOM and outputs it
before it outputs the document (but after the HTTP headers, of course)
so that the document can still be recognized as UTF-8 when it's saved
to disk (where no Content-Type headers with a charset specification are
available).

------------------------------------------------------------------------

[2003-11-07 03:09:53] trunghongnguyen at yahoo dot com

ertre

------------------------------------------------------------------------

[2003-10-31 11:12:06] [EMAIL PROTECTED]

I added i18n support to Zend Engine 2 (though it's still partial
one...), and one of its features contain awareness of BOM. So now you
can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and
configure it with the option '--enable-zend-multibyte'.

These features are still experimental and under testing, so that I have
not been documented these but I'll add the entry to the manual,
ZEND_CHANGES and so on if I feel certain of the stability and
robustness of my patch, though I do not know when it is:)

Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP
5.0.0b2 is assured to work well for this problem. Comments are welcome.

------------------------------------------------------------------------

[2003-02-07 23:13:07] bugzilla at jellycan dot com

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the
correct characterization of this bug or not I leave up to you.

Regards,
Brodie.

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1

Reply via email to