Edit report at http://bugs.php.net/bug.php?id=36785&edit=1

 ID:                 36785
 Comment by:         tom at tomclegg dot net
 Reported by:        giunta_gaetano at libero dot it
 Summary:            xml_parse return invalid character error with
                     ISO-8859-1 data
 Status:             Bogus
 Type:               Bug
 Package:            XML related
 Operating System:   windows 2000
 PHP Version:        5.1.2
 Block user comment: N
 Private report:     N

 New Comment:

Bug #33375 hints at it, but giunta_gaetano has explained it better
here.



The problem is that there is no (sensible) way to tell the XML parser

which character encoding to use in cases where the XML declaration does

not specify an encoding.



Bug #33375 vaguely hints that XML encodings other than UTF-8 must be

specified in the XML declaration.  On the contrary,

http://www.w3.org/TR/REC-xml/#charencoding specifically allows for

the encoding to be specified by alternate means (for example, MIME 

headers).  Why shouldn't PHP have the ability to work in such an

environment?



Meanwhile, a workaround is to use preg_replace() to add an encoding

attribute to the XML declaration before passing the XML data to

xml_parse().  Ugly, but more effective than saying "it shouldn't

work".


Previous Comments:
------------------------------------------------------------------------
[2006-03-19 21:12:24] [email protected]

The cause and the solution is properly explained in bug #33375.

NO bug here.

------------------------------------------------------------------------
[2006-03-19 00:21:27] giunta_gaetano at libero dot it

Description:
------------
PLEASE REOPEN AND FIX BUG #33375!



It bewilders me that this has not yet been fixed in php 5.2.1...



It is a BC breakage against PHP 4, and makes very very little sense
anyway:



- xml does NOT mandate a charset specification in the prologue



- other communication/storage layers impose DIFFERENT standards on
charset declarations and default charset values that the xml spec does
by itself



to be more clear, a common example:

- received xml message has no charset in the prologue

- it is received over HTTP, and the http content-type header  states a
charset (it is authoritative, according to the specs)

- there is no way to tell the xml parser to use the correct charset for
parsing the message!



Why on earth was it not decided, when switching to libxml, that
xml_parser_create() would get some automagic new powers, while
xml_parser_create('ISO-8859-1') would be 100% backwards compatible and
let the coder specify a source charset???



PS: at least fix the manual, and clearly specify that in order for the
'magical charset detection' to work, the xml prologue MUST contain a
charset declaration!!!



PPS: last but not least: the column number where the error is found
(xml_get_current_column_number()) is also borked: whereas with php 4 the
error reported the column corresponding to the first non-ascii char
found, with php 5 the error reports the column where the xml element
closing tag starts, which is a bit misleading...

Reproduce code:
---------------
Just try to parse any ISO-8859-1 xml file that has no charset specified
in the prologue.

Expected result:
----------------
no error

Actual result:
--------------
a dumb parsing error


------------------------------------------------------------------------



-- 
Edit this bug report at http://bugs.php.net/bug.php?id=36785&edit=1

Reply via email to