Re: UTF-8 signature in web and email

DougEwell2 Thu, 17 May 2001 21:09:59 -0700
The "UTF-8 signature" discussion appears every few months on this list, 
usually as a religious debate between those who believe in it and those who 
do not.  Be forewarned, my religion may not match yours.  :-)

Keld Jørn Simonsen wrote:

> For UTF-8 there is no need to have a BOM, as there is only one
> way of serializing octets in UTF-8. There is no little-endian
> or big-endian. A BOM is superfluous and will be ignored.

The debate is not about whether byte order needs to be specified in a UTF-8 
file (of course it doesn't) but whether U+FEFF should be used as a signature 
to identify the file as UTF-8, rather than some other byte-oriented encoding.

Martin Dürst wrote:

> There is about 5% of a justification
> for having a 'signature' on a plain-text, standalone file (the reason
> being that it's somewhat easier to detect that the file is UTF-8 from the
> signature than to read through the file and check the byte patterns
> (which is an extremely good method to distinguish UTF-8 from everything
> else)).

A plain-text file is more in need of such a signature than any other type of 
file.  It is true that "fancy" text such as HTML or XML, which already has a 
mechanism to indicate the character encoding, doesn't need a signature, but 
this is not necessarily true of plain-text files, which will continue to 
exist for a long time to come.

The strategy of checking byte patterns to detect UTF-8 is usually accurate, 
but may require that the entire file be checked instead of just the first 
three bytes.  In his September 1997 presentation in San Jose, Martin conceded 
that "Because probability to detect UTF-8 [without a signature] is high, but 
not 100%, this is a heuristic method" and then spent several pages evaluating 
and refining the heuristics.  Using a signature is not somewhat easier, it is 
*much* easier.

> - When producing UTF-8 files/documents, *never* produce a 'signature'.
>    There are quite some receivers that cannot deal with it, or that deal
>    with it by displaying something. And there are many other problems.

If U+FEFF is not interpreted as a BOM or signature, then by process of 
elimination it should be interpreted as a zero-width no-break space (ZWNBSP; 
more on this later).  Any receiver that deals with a ZWNBSP by displaying a 
visible glyph is not very smart about they way it handles Unicode text, and 
should not be the deciding factor in how to encode it.

What are the "many other problems"?  Does this comment refer to programs and 
protocols that require their own signatures as the first few bytes of an 
input file (like shell scripts)?  The Unicode Standard 3.0 explicitly states 
on page 325, "Systems that use the byte order mark must recognize that an 
initial U+FEFF signals the byte order; it is not part of the textual 
content."  Programs that go bonkers when handed a BOM need to be corrected to 
conform to the intent of the UTC.

> For XML, the 'signature' is now listed in App F.1
> http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
> But this is not normative, and fairly recent, and so you should never
> expect an XML processor to accept it (except as a plain character
> in the file when there is no XML declaration).

Everything about XML is "fairly recent."  And again, current versions of 
applications that are slightly broken in their handling of legitimate Unicode 
characters should not dictate the way Unicode is to be used.

In the C9X charter, the base document for the revision of the C programming 
language, the #1 guiding principle was "Existing code is important, existing 
implementations are not."  In the Unicode context, "code" is textual Unicode 
data and "implementations" are browsers, XML processors, and such.  
Implementations will in time be upgraded to provide better Unicode support.  
The techniques used to encode text in UTF-8 should not be dependent on the 
current imperfection of today's implementations.

Back on 2000-06-22, in the thread "Re: UTF-8N?", Ken Whistler pointed out 
that the real problem with the UTF-8 signature/BOM was that its functionality 
had been "bizarrely unified" with that of the ZWNBSP, and noted that the new 
character U+2060 WORD JOINER would be introduced in Unicode 3.2 to take over 
the ZWNBSP duties from U+FEFF.  Indeed, the proposed Unicode 3.2 code chart 
(available at http://www.unicode.org/charts/draftunicode32/U32-2000.pdf) 
describes the WORD JOINER explicitly as "intended for disambiguation of 
functions for BOM."

What all this means is that the UTC is committed to preserving the utility of 
U+FEFF as a byte order mark, and by extension a signature.  As Marco 
Cimarosti observed, the FAQ on the Unicode Web site describes the use of the 
BOM as a signature for "otherwise unmarked" UTF-8 text files, without once 
deprecating or discouraging that usage.

The possibility of confusion over interpreting an initial U+FEFF as BOM or 
ZWNBSP absolutely should NOT be a justification for discouraging the BOM.  
The sole purpose of a zero-width no-break space -- regardless of where or how 
encoded -- is to divide two lexical units logically without rendering a 
visible space or line break.  When would such a character ever be appropriate 
as the first character of a text stream?  What would it divide?

"But what about a process that breaks a text stream into chunks and, say, 
transmits the chunks down a wire?  You can't depend on the meaning of an 
'initial' U+FEFF then."  That's true, but any process that deals with data in 
this manner should not be interpreting or modifying it anyway.  Imagine the 
damage that could be caused to CR/LF pairs that were inadvertently separated 
into two chunks.

Ken wrote in his 2000-06-22 message, "If you can at all help it, start 
refraining now from using U+FEFF as a zero-width non-breaking space," and I 
seriously doubt that many applications have been doing this in any case, 
compared to the number that use U+FEFF as a signature or byte-order mark.

I believe there is a common thread between this topic and the topic of Plane 
14 tags (although I have pretty much conceded defeat on that one), namely 
that there are those who believe a certain, limited amount of metadata is 
appropriate in plain-text files, and those who believe that all metadata 
should reside in a higher-level format or maybe that plain-text files are 
irrelevant in the 21st century.  In the case of UTF-8 signatures, I hope 
there is some popular support for the notion that the U+FEFF signature is 
more beneficial than harmful.

-Doug Ewell
 Fullerton, California
Re: UTF-8 signature in web and email

Reply via email to