Hi,

Just an update.  I did not have a released package to check for Byte
Order Marks (BOMs) when I wrote my previous message for this bug.
Today, however, I uploaded what I think is a reasonably working
version of my package utfcheck, which checks an input file to see if
it is valid UTF-8.  Among other things it will look for and report
BOMs anywhere in an input file.  You can see it here:

     https://tracker.debian.org/pkg/utfcheck

Start with the latest uploaded version, utfcheck-1.2-1.  That has
additional information in the man page that was not in previous
versions.  It prints a summary after reaching end of file on its
input, and will even report BOMs that are embedded in the middle of a
file--something that a user could easily miss with less.

As for the BOM being in text files, programs need to handle that
condition correctly.  I mentioned in my last message that a UTF-8 BOM
has been specifically mentioned as legal Unicode in Table 2.4 of the
Unicode Standard from version 5.0 onwards, and actually before then
that table was Table 2.3 in version 4.0 of the standard.  Someone
mentioned elsewhere that LibreOffice includes the BOM in text files,
so if nothing else there is no escaping it if a program is to read
UTF-8 input from a text file that LibreOffice creates.

In version 3.0 of the Unicode Standard, Section 13.6 "Specials" on p.
324 states that "In UTF-8...this sequence can serve as a signature for
UTF-8 encoded text where the character set is unmarked."  That was
published in 2000, almost 20 years ago.

I could look it up in older versions of the Unicode Standard too (I
have them going back to version 1.0), but I hope there can be general
agreement that such a long period of time is long enough for
application software to conform.

In keeping with the original Unix philosophy that a program should do
one thing, and do it well, less should be able to render scrolling
text as it is meant to be displayed.  That doesn't solve the situation
of the UTF-8 BOM though.  I hope that utfcheck helps the situation.

Thanks,


Paul Hardy

Reply via email to