Hello all,
I sent this same email to the c-dev list. Its content applies from both
a user as well as a dev (mods) perspective, so I'm posting to this list
as well.
-----------------
I realize that the UTF-8 spec does not require the 0xEFBBBF 3-byte BOM
be added to an UTF-8 encoded file, but some editors (MS, vim, etc.) use
this BOM when reading the XML file to determine encoding. The reality
of the situation is that a number of UTF-8 files do contain a BOM, and
this trend seems to becoming more prevalent (at least with the XML
datasets that I have been exposed to over the years) with time.
Luckily, Xerces handles BOM markers for UTF-8 files already, there is
not a compatibility issue with being able to read their own generated
files.
My suggestion is to allow Xerces to generate a BOM for a UTF-8 encoded
file if is explicitly asked to do so through the serializer (DOMWriter)
by setting the XMLUni::fgDOMWRTDOM feature. Most people won't set this
feature resulting in the current solution of generated UTF-8 files not
containing the BOM, but by making this change the addition of a BOM for
UTF-8 encoded generated files would now be an option for those who
indeed do want it.
Since the Xerces code is well written, the code modifications would be
quite small to accommodate this change.
I can make the changes and submit as a patch request, but first I would
like to generate a discussion about this topic to help determine what
the best implementation should be. I'd ask that a pragmatic and
realistic viewpoint rather than a hard-line spec viewpoint be adopted
since the reality of BOMs for UTF-8 encoded files are out there and will
not be going away.
Thank you,
_Nicholas