Re: Starnge characters when displaying html files saved in UTF-8 format

Brian Stell Tue, 11 Dec 2001 13:23:36 -0800

Jarkko Hietaniemi wrote:
> 
> On Tue, Dec 11, 2001 at 11:34:09AM -0800, Brian Stell wrote:
> > Jalal,
> >
> > Kindly reply via the mailing list so others can see the discussion.
> > That way others can benefit and/or help.
> >
> > BOM is the Byte Order Mark used in Unicode to indicate an
> > important detail about the Unicode data stream.
> >
> > Perhaps the Perl people can describe how to inhibit the BOM?
> 
> I don't think it's Perl putting the BOM in there.
> 
> I opened up Notepad in Win2000, wrote "foobar", and saved the file
> as "ANSI", "UTF-8", "Unicode", and "Unicode big endian".  Then in UNIX
> with this
> 
>   perl -e 'print "$ARGV[0]: "; print unpack "H*", <>; print "\n"' file.name
> 
> I get
> 
> foo.ansi: feff0066006f006f006200610072000d000a
> foo.utf8: efbbbf666f6f6261720d0a
> foo.unic: fffe66006f006f006200610072000d000a
> foo.unib: feff0066006f006f006200610072000d000a
> 
> (copied by hand, so typos possible) which looks like little-endian
> UTF-16, UTF-8, big-endian UTF-16, and (again) little-endian UTF-16
> to me.  For example the "Unicode" is first the BOM, then the 0x66
> aka "f", then two 0x6f:s, aka "o", then 0x62, aka "b", and so on.
> 
> No Perl was involved in creating these files, but the BOMs are there
> (the UTF-8 0xEF 0xBB 0xBF is the BOMin disguise).
> 
> Moreover, if the browser claims to do Unicode, it should recognize the
> BOM, too, and ignore it in display (but of course use it to figure out
> the right endianness).

The BOM is valid as the *first* character. I'm not sure what the
spec says about subsequent chars.

How did the browsers handle the foo.* files?

Of course you will may need to manually set the encoding to get 
proper results since these do not have a charset tag. I do believe
that the Netscape 6.2 universal autodetector should detect it 
automatically (when turned on).

-- 
Brian Stell
Re: Starnge characters when displaying html files saved in UTF-8 format

Reply via email to