RE: UTF-8 BOM generation option

Thayer_Nicholas Mon, 31 Dec 2007 10:24:50 -0800

Hello Keith,

Yes, interoperability is a primary concern.  That is why the parsers
_must_ generate UTF-8 without a BOM for those parsers that do not handle
this case, and I believe having the BOM not be output on a default UTF-8
generation is the appropriate solution given that the spec does not
require the BOM to be present for byte streams.

Therefore, the safest route is to not generate a BOM by default because
all XML parsers should handle non-BOM UTF-8 xml files.  If the user's
environment is closed enough to verify that the available parsers will
handle UTF-8 files with a BOM, then the option to add the BOM flag to
the generated files should be available.

My suggestion results in a zero-impact for most users, but still allows
the option of BOM generation for those that request it.

Regards,
Nicholas

-----Original Message-----
From: Keith Mendoza [mailto:[EMAIL PROTECTED] 
Sent: Monday, December 31, 2007 11:14 AM
To: [email protected]
Subject: Re: UTF-8 BOM generation option

Here's my 0.02 in this issue: I think we should look at the safest route
to
take with this. As Nicholas stated, is that files containing this BOM is
becoming more prevalent. So if that's the case, I personally think that
Xerces (both Java and C versions) should just generate the BOM.

However, I also understand that this change could cause a potential
problem.
One situation I see is application using XML for some kind of
inter-process
communication, not necessarily XML-RPM or SOAP. So if we got one
application
using Xerces to parse the XML data received; and another one NOT using
Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent
application transmit the 3-byte BOM, will the other application handle
the
data properly or not?

Hope this helps stir up the conversation,
Keith

On Dec 31, 2007 8:28 AM, <[EMAIL PROTECTED]> wrote:

> Hello all,
>
> I sent this same email to the c-dev list.  Its content applies from
both
> a user as well as a dev (mods) perspective, so I'm posting to this
list
> as well.
>
> -----------------
>
> I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
> be added to an UTF-8 encoded file, but some editors (MS, vim, etc.)
use
> this BOM when reading the XML file to determine encoding.  The reality
> of the situation is that a number of UTF-8 files do contain a BOM, and
> this trend seems to becoming more prevalent (at least with the XML
> datasets that I have been exposed to over the years) with time.
>
> Luckily, Xerces handles BOM markers for UTF-8 files already, there is
> not a compatibility issue with being able to read their own generated
> files.
>
> My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
> file if is explicitly asked to do so through the serializer
(DOMWriter)
> by setting the XMLUni::fgDOMWRTDOM feature.  Most people won't set
this
> feature resulting in the current solution of generated UTF-8 files not
> containing the BOM, but by making this change the addition of a BOM
for
> UTF-8 encoded generated files would now be an option for those who
> indeed do want it.
>
> Since the Xerces code is well written, the code modifications would be
> quite small to accommodate this change.
>
> I can make the changes and submit as a patch request, but first I
would
> like to generate a discussion about this topic to help determine what
> the best implementation should be.  I'd ask that a pragmatic and
> realistic viewpoint rather than a hard-line spec viewpoint be adopted
> since the reality of BOMs for UTF-8 encoded files are out there and
will
> not be going away.
>
> Thank you,
> _Nicholas
>
>

-- 
www.savedbycuriosity.com

RE: UTF-8 BOM generation option

Reply via email to