[ http://issues.apache.org/jira/browse/XERCESC-1284?page=comments#action_12367348 ]
Alberto Massari commented on XERCESC-1284: ------------------------------------------ Hi David, if the stream doesn't include a BOM (as the RFC mandates), we work in the same way as we did; the only difference is that now we allow non-compliant application to work too. > Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure > --------------------------------------------------------------------------- > > Key: XERCESC-1284 > URL: http://issues.apache.org/jira/browse/XERCESC-1284 > Project: Xerces-C++ > Type: Bug > Versions: 2.6.0 > Environment: Fedora Core 1, x86 PC, gcc. Also seen similar failures in a > Solaris 9 environment with the forte compiler. > Reporter: Daniel McLean > Attachments: MemParseEncoding.tar.gz, utf8BOMTest.tar.gz > > Setting the encoding as "UTF-16" using the InputSource.setEncoding() method > seems to create problems during parsing. > If I have a UTF-16BE document with a BOM, this parses successfully when no > encoding set is explicitly set or when the encoding is set to "UTF-16BE". > When set to "UTF-16", a fatal error occurs with: > Fatal Error at (file test, line 1, char 1): Invalid document structure > Some investigation: Having looked through the Xerces source and done some > testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder > is used when a match is detected against the known encoding string. When > "UTF-16" is set, no known encoding is detected and the document is probed for > an encoding, resulting in the XMLUTF16Transcoder being used. In the latter > case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and > choking because it doesn't look like a piece of prologue. I'm guessing that > either the trancoder should have removed the BOM, the BOM should be detected > and ignored, or the BOM should have been trimmed off beforehand. > I've attached a test case which is derived from the MemParse sample, which > parses four different UTF-16 document (BE with BOM, BE without BOM, LE with > BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in > my case I want to know what happens if a client of my software feeds in a > UTF-16 document without a BOM) using four different encoding approaches (no > encoding set, "UTF-16", "UTF-16BE", "UTF-16LE"). > A summary of parsing success and failure on linux: > FILE: UTF-16BE with BOM > ENCODING: : Succeeded. > ENCODING: UTF-16: Fatal error. > ENCODING: UTF-16BE: Succeeded. > ENCODING: UTF-16LE: Fatal error. > -------------------------------- > FILE: UTF-16BE without BOM > ENCODING: Fatal error. (due to guess of UTF-8) > ENCODING: UTF-16: Succeeded. > ENCODING: UTF-16BE: Succeeded. > ENCODING: UTF-16LE: Fatal error. > -------------------------------- > FILE: UTF-16LE with BOM > ENCODING: : Succeeded. > ENCODING: UTF-16: Fatal error. > ENCODING: UTF-16BE: Fatal error. > ENCODING: UTF-16LE: Succeeded. > -------------------------------- > FILE: UTF-16LE with BOM > ENCODING: : Fatal error. (due to guess of UTF-8) > ENCODING: UTF-16: Succeeded. > ENCODING: UTF-16BE: Fatal error. > ENCODING: UTF-16LE: Succeeded. > -------------------------------- > Maybe there is a good reason for Xerces current behaviour, but it > escapes me. I note that the lack of BOM helps parser success > when setting an encoding of "UTF-16", supporting my assertion above. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
