Re: dos2unix and UTF-8 BOM

Jungshik Shin Mon, 17 Feb 2003 05:20:38 -0800


On Sun, 16 Feb 2003, Roozbeh Pournader wrote:


> I was thinking about the annoying BOM-like sequence that Windows 2000's
> and XP's Notepads are putting at the beginning of UTF-8 files. The byte
> sequence "EF BB BF" that's invalid as a header/signature in Unix UTF-8.
>
> Shouldn't 'dos2unix' be patched to also remove this sequence?

  That would be useful. However, that doesn't work very well if multiples
files are fed to it (e.g. 'cat a b c | dos2unix'). And, that's why
we all hate UTF-8 BOM ;-).

  How about these?

 Incidentally, it just occurred to me that  ftp/ssh clients may offer an
user-configurable option for the  automatic removal of  'UTF-8 BOM' at
the beginning of a text file in UTF-8 when moving files from Windows to
non-Windows platforms (Unix/Unix-like OS and MacOS). The same is true
of Kermit (Frank, are you here?). All those tools can be configured
to translate between three (and nowadays even more?) EOL conventions,
CF/LF/CR,LF for text files. Then, the automatic removal(and addition if
that's regarded as necessary) of UTF-8 BOM at platform boundaries
would be as useful.

   As for web servers, a configurable option can be added to remove
UTF-8 BOM at the beginning of text/* files(they serve). For instance,
it's easy to write a simple module for Apache(used at Unicode.org web
site) to do that.

   VFAT, NTFS and  FAT for Linux can be modified in a similar way.
And, editors like Vim (which automatically detects EOL used in
text files) can do the same.

   Jungshik

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: dos2unix and UTF-8 BOM

Reply via email to