Jarkko Hietaniemi wrote:
>
> On Tue, Dec 11, 2001 at 11:34:09AM -0800, Brian Stell wrote:
> > Jalal,
> >
> > Kindly reply via the mailing list so others can see the discussion.
> > That way others can benefit and/or help.
> >
> > BOM is the Byte Order Mark used in Unicode to indicate an
> > important detail about the Unicode data stream.
> >
> > Perhaps the Perl people can describe how to inhibit the BOM?
>
> I don't think it's Perl putting the BOM in there.
>
> I opened up Notepad in Win2000, wrote "foobar", and saved the file
> as "ANSI", "UTF-8", "Unicode", and "Unicode big endian". Then in UNIX
> with this
>
> perl -e 'print "$ARGV[0]: "; print unpack "H*", <>; print "\n"' file.name
>
> I get
>
> foo.ansi: feff0066006f006f006200610072000d000a
> foo.utf8: efbbbf666f6f6261720d0a
> foo.unic: fffe66006f006f006200610072000d000a
> foo.unib: feff0066006f006f006200610072000d000a
>
> (copied by hand, so typos possible) which looks like little-endian
> UTF-16, UTF-8, big-endian UTF-16, and (again) little-endian UTF-16
> to me. For example the "Unicode" is first the BOM, then the 0x66
> aka "f", then two 0x6f:s, aka "o", then 0x62, aka "b", and so on.
>
> No Perl was involved in creating these files, but the BOMs are there
> (the UTF-8 0xEF 0xBB 0xBF is the BOMin disguise).
>
> Moreover, if the browser claims to do Unicode, it should recognize the
> BOM, too, and ignore it in display (but of course use it to figure out
> the right endianness).
The BOM is valid as the *first* character. I'm not sure what the
spec says about subsequent chars.
How did the browsers handle the foo.* files?
Of course you will may need to manually set the encoding to get
proper results since these do not have a charset tag. I do believe
that the Netscape 6.2 universal autodetector should detect it
automatically (when turned on).
--
Brian Stell