Hello Keith, Yes, interoperability is a primary concern. That is why the parsers _must_ generate UTF-8 without a BOM for those parsers that do not handle this case, and I believe having the BOM not be output on a default UTF-8 generation is the appropriate solution given that the spec does not require the BOM to be present for byte streams.
Therefore, the safest route is to not generate a BOM by default because all XML parsers should handle non-BOM UTF-8 xml files. If the user's environment is closed enough to verify that the available parsers will handle UTF-8 files with a BOM, then the option to add the BOM flag to the generated files should be available. My suggestion results in a zero-impact for most users, but still allows the option of BOM generation for those that request it. Regards, Nicholas -----Original Message----- From: Keith Mendoza [mailto:[EMAIL PROTECTED] Sent: Monday, December 31, 2007 11:14 AM To: [email protected] Subject: Re: UTF-8 BOM generation option Here's my 0.02 in this issue: I think we should look at the safest route to take with this. As Nicholas stated, is that files containing this BOM is becoming more prevalent. So if that's the case, I personally think that Xerces (both Java and C versions) should just generate the BOM. However, I also understand that this change could cause a potential problem. One situation I see is application using XML for some kind of inter-process communication, not necessarily XML-RPM or SOAP. So if we got one application using Xerces to parse the XML data received; and another one NOT using Xerces and NOT supporting the 3-byte BOM. If the Xerces-dependent application transmit the 3-byte BOM, will the other application handle the data properly or not? Hope this helps stir up the conversation, Keith On Dec 31, 2007 8:28 AM, <[EMAIL PROTECTED]> wrote: > Hello all, > > I sent this same email to the c-dev list. Its content applies from both > a user as well as a dev (mods) perspective, so I'm posting to this list > as well. > > ----------------- > > I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM > be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use > this BOM when reading the XML file to determine encoding. The reality > of the situation is that a number of UTF-8 files do contain a BOM, and > this trend seems to becoming more prevalent (at least with the XML > datasets that I have been exposed to over the years) with time. > > Luckily, Xerces handles BOM markers for UTF-8 files already, there is > not a compatibility issue with being able to read their own generated > files. > > My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded > file if is explicitly asked to do so through the serializer (DOMWriter) > by setting the XMLUni::fgDOMWRTDOM feature. Most people won't set this > feature resulting in the current solution of generated UTF-8 files not > containing the BOM, but by making this change the addition of a BOM for > UTF-8 encoded generated files would now be an option for those who > indeed do want it. > > Since the Xerces code is well written, the code modifications would be > quite small to accommodate this change. > > I can make the changes and submit as a patch request, but first I would > like to generate a discussion about this topic to help determine what > the best implementation should be. I'd ask that a pragmatic and > realistic viewpoint rather than a hard-line spec viewpoint be adopted > since the reality of BOMs for UTF-8 encoded files are out there and will > not be going away. > > Thank you, > _Nicholas > > -- www.savedbycuriosity.com
