On Sun, 16 Feb 2003, Roozbeh Pournader wrote:
> I was thinking about the annoying BOM-like sequence that Windows 2000's > and XP's Notepads are putting at the beginning of UTF-8 files. The byte > sequence "EF BB BF" that's invalid as a header/signature in Unix UTF-8. > > Shouldn't 'dos2unix' be patched to also remove this sequence? That would be useful. However, that doesn't work very well if multiples files are fed to it (e.g. 'cat a b c | dos2unix'). And, that's why we all hate UTF-8 BOM ;-). How about these? Incidentally, it just occurred to me that ftp/ssh clients may offer an user-configurable option for the automatic removal of 'UTF-8 BOM' at the beginning of a text file in UTF-8 when moving files from Windows to non-Windows platforms (Unix/Unix-like OS and MacOS). The same is true of Kermit (Frank, are you here?). All those tools can be configured to translate between three (and nowadays even more?) EOL conventions, CF/LF/CR,LF for text files. Then, the automatic removal(and addition if that's regarded as necessary) of UTF-8 BOM at platform boundaries would be as useful. As for web servers, a configurable option can be added to remove UTF-8 BOM at the beginning of text/* files(they serve). For instance, it's easy to write a simple module for Apache(used at Unicode.org web site) to do that. VFAT, NTFS and FAT for Linux can be modified in a similar way. And, editors like Vim (which automatically detects EOL used in text files) can do the same. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
