[ 
http://issues.apache.org/jira/browse/XERCESC-1284?page=comments#action_12367060 
] 

David Bertoni commented on XERCESC-1284:
----------------------------------------

Are we sure our behavior here agrees with RFC 2781?

http://rfc.net/rfc2781.html

Note specifically the text in section 3.3:

  "Any labelling application that uses UTF-16 character encoding, and
   explicitly labels the text, and knows the serialization order of the
   characters in text, SHOULD label the text as either "UTF-16BE" or
   "UTF-16LE", whichever is appropriate based on the endianness of the
   text. This allows applications processing the text, but unable to
   look inside the text, to know the serialization definitively.

   Text in the "UTF-16BE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in big-endian order.
   Systems labelling UTF-16BE text MUST NOT prepend a BOM to the text.

   Text in the "UTF-16LE" charset MUST be serialized with the octets
   which make up a single 16-bit UTF-16 value in little-endian order.
   Systems labelling UTF-16LE text MUST NOT prepend a BOM to the text.

   Any labelling application that uses UTF-16 character encoding, and
   puts an explicit charset label on the text, and does not know the
   serialization order of the characters in text, MUST label the text as
   "UTF-16", and SHOULD make sure the text starts with 0xFEFF.

   An exception to the "SHOULD" rule of using "UTF-16BE" or "UTF-16LE"
   would occur with document formats that mandate a BOM in UTF-16 text,
   thereby requiring the use of the "UTF-16" tag only."

Since users are likely to apply MIME encodings to documents parsed by Xerces-C 
using InputSource::setEncoding(), we should make sure out implementation is 
consistent with this RFC.  Note that this seems to me to mandate that streams 
using the encoding labels UTF-16LE and UTF-16BE must not contain a BOM.

> Set "UTF-16" encoding for UTF16-BE entity with BOM results in parse failure
> ---------------------------------------------------------------------------
>
>          Key: XERCESC-1284
>          URL: http://issues.apache.org/jira/browse/XERCESC-1284
>      Project: Xerces-C++
>         Type: Bug
>     Versions: 2.6.0
>  Environment: Fedora Core 1, x86 PC, gcc.  Also seen similar failures in a 
> Solaris 9 environment with the forte compiler.
>     Reporter: Daniel McLean
>  Attachments: MemParseEncoding.tar.gz, utf8BOMTest.tar.gz
>
> Setting the encoding as "UTF-16" using the InputSource.setEncoding() method 
> seems to create problems during parsing.
> If I have a UTF-16BE document with a BOM, this parses successfully when no 
> encoding set is explicitly set or when the encoding is set to "UTF-16BE".  
> When set to "UTF-16", a fatal error occurs with:               
>    Fatal Error at (file test, line 1, char 1): Invalid document structure
> Some investigation: Having looked through the Xerces source and done some 
> testing, it appears that when "UTF-16BE" is set, the "UTF-16 (BE)" transcoder 
> is used when a match is detected against the known encoding string.  When 
> "UTF-16" is set, no known encoding is detected and the document is probed for 
> an encoding, resulting in the XMLUTF16Transcoder being used.  In the latter 
> case, when XMLScanner::scanProlog() is called, it ends up reading the BOM and 
> choking because it doesn't look like a piece of prologue.  I'm guessing that 
> either the trancoder should have removed the BOM, the BOM should be detected 
> and ignored, or the BOM should have been trimmed off beforehand.
> I've attached a test case which is derived from the MemParse sample, which 
> parses four different UTF-16 document (BE with BOM, BE without BOM, LE with 
> BOM, LE without BOM (I realise UTF-16 XML entities should have a BOM, but in 
> my case I want to know what happens if a client of my software feeds in a 
> UTF-16 document without a BOM) using four different encoding approaches (no 
> encoding set, "UTF-16", "UTF-16BE", "UTF-16LE").
> A summary of parsing success and failure on linux:
> FILE: UTF-16BE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16BE without BOM
> ENCODING: Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Succeeded.
> ENCODING: UTF-16LE: Fatal error.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Succeeded.
> ENCODING: UTF-16: Fatal error.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> FILE: UTF-16LE with BOM
> ENCODING: : Fatal error. (due to guess of UTF-8)
> ENCODING: UTF-16: Succeeded.
> ENCODING: UTF-16BE: Fatal error.
> ENCODING: UTF-16LE: Succeeded.
> --------------------------------
> Maybe there is a good reason for Xerces current behaviour, but it
> escapes me.  I note that the lack of BOM helps parser success
> when setting an encoding of "UTF-16", supporting my assertion above.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to