At 10:58 PM -0400 5/17/01, [EMAIL PROTECTED] wrote:
>The "UTF-8 signature" discussion appears every few months on this list,
>usually as a religious debate between those who believe in it and those who
>do not.  Be forewarned, my religion may not match yours.  :-)

My religion suggests that we find common ground and not engage in rwars.

>Keld Jørn Simonsen wrote:
>
>>  For UTF-8 there is no need to have a BOM, as there is only one
>>  way of serializing octets in UTF-8. There is no little-endian
>>  or big-endian. A BOM is superfluous and will be ignored.

You could say "should be ignored", but you can't speak for everybody 
else's software.

>The debate is not about whether byte order needs to be specified in a UTF-8
>file (of course it doesn't) but whether U+FEFF should be used as a signature
>to identify the file as UTF-8, rather than some other byte-oriented encoding.

Which will only work if the software is ready to handle it.

>Martin Dürst wrote:
>
>>  There is about 5% of a justification
>>  for having a 'signature' on a plain-text, standalone file (the reason
>>  being that it's somewhat easier to detect that the file is UTF-8 from the
>>  signature than to read through the file and check the byte patterns
>>  (which is an extremely good method to distinguish UTF-8 from everything
>  > else)).
[snip]
OK, that's enough context.

Last year, as previously the year before, we discussed the 
possibility of defining some standard Unicode plain text formats. The 
discussions foundered on the differences between text files meant for 
people to read, such as e-mail, FAQs, and so on, and text files meant 
for computers to process, such as delimited data files. We could not 
agree, for example, whether a limit on line length was to be 
required, permitted, or forbidden. We could not even agree that the 
rules would be different for different cases, and that we would 
attempt to enumerate the cases our standard would cover.

This BOM-as-signature debate is of the same type. Is it to be 
required, permitted, forbidden, or something else? The short answer 
is No. Users do not agree, and software cannot be made to agree, not 
even if a formal standard were created and widely used.

Martin knows of no actual cases where a non-UTF-8 file could be 
mistaken for UTF-8, so he says the signature is unnecessary, and goes 
on to say that it is actually harmful. Specifically, he asks how all 
Unix text-handling software could be made to work with a signature. 
It can't all be changed, but here is a possible method for coping.

Create a filter that strips an initial signature from a text stream, 
and passes the remainder through unchanged. You can be picky and make 
it verify that the stream is in UTF-8, if you like.

Create a filter that adds a signature to the beginning of a text 
stream, if it does not already have one. You can be picky, again.

Create a filter that can identify character sets heuristically and 
convert them to UTF-8.

Write your scripts carefully, so that you know when you are handling 
text in unknown character sets, and apply these filters as needed.

Then ordinary Unix utilities will be fed data that they will not 
choke on, in known encodings without extraneous non-text data.

In all other contexts, such as XML, if the standard allows for a 
signature, fine, and if not, don't use one. If there is no standard, 
you have to negotiate a private agreement if you want to send people 
something out of the ordinary.


Another way to look at the matter is to say that plain text is plain, 
and a signature is markup. Then a text file with a signature is, if 
not rich text, at least above the poverty line.
-- 

Edward Cherlin
Generalist
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland

Reply via email to