[ 
http://issues.apache.org/jira/browse/XERCESC-1284?page=comments#action_12367348 
] 

Alberto Massari commented on XERCESC-1284:
------------------------------------------

Hi David,
if the stream doesn't include a BOM (as the RFC mandates), we work in the same 
way as we did; the only difference is that now we allow non-compliant 
application to work too.

> Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
> ---------------------------------------------------------------------------
>
>          Key: XERCESC-1284
>          URL: http://issues.apache.org/jira/browse/XERCESC-1284
>      Project: Xerces-C++
>         Type: Bug
>     Versions: 2.6.0
>  Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a 
> Solaris 9 environment with the forte compiler.
>     Reporter: Daniel McLean
>  Attachments: MemParseEncoding.tar.gz, utf8BOMTest.tar.gz
>
> Setting the encoding as "UTF-16" using the InputSource.setEncoding() method 
> seems to create problems during parsing.
> If I have a UTF-16BE document with a BOM, this parses successfully when no 
> encoding set is explicitly set or when the encoding is set to "UTF-16BE".  
> When set to "UTF-16", a fatal error occurs with:               
>    Fatal Error at (file test, line 1, char 1): Invalid document structure
> Some investigation: Having looked through the Xerces source and done some 
> testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder 
> is used when a match is detected against the known encoding string.  When 
> "UTF-16" is set, no known encoding is detected and the document is probed for 
> an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter 
> case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and 
> choking because it doesn't look like a piece of prologue.  I'm guessing that 
> either the trancoder should have removed the BOM, the BOM should be detected 
> and ignored, or the BOM should have been trimmed off beforehand.
> I've attached a test case which is derived from the MemParse sample, which 
> parses four different UTF-16 document (BE with BOM, BE without BOM, LE with 
> BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in 
> my case I want to know what happens if a client of my software feeds in a 
> UTF-16 document without a BOM) using four different encoding approaches (no 
> encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").
> A summary of parsing success and failure on linux:
> FILE: UTF-16BE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16BE without BOM
> ENCODING: Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> Maybe there is a good reason for Xerces current behaviour, but it
> escapes me.  I note that the lack of BOM helps parser success
> when setting an encoding of "UTF-16", supporting my assertion above.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to