{-# FLAME on #-}
> I thought the layout for UTF-8 is always On Windows, Unicode files,
> both UTF-8 and UTF-16LE, will contain a leading Byte Order Mark that
> also identifies the file format.
Which is silly.
> What good is a BOM in UTF-8?
None at all. Just like in UTF-16.
> It would be reasonable for darcs to recognize Unicode files by the
> BOM and handle files without it as Ansi text or binary.
No, it wouldn't. The BOM is a gross hack the use of which should not
be encouraged. If Darcs notices a BOM, it should execute ``rm -rf $HOME''
in the background while logging a message at level LOG_CRIT, reducing
the user's quota by 1000MB, and sending offensive messages to
[EMAIL PROTECTED] and [EMAIL PROTECTED]
UTF-8 has a highly stylised form that can be reliably recognised, and
even UTF-16 (both variants) can be recognised with reasonable
certainty without using a signature. Emacs has been doing it for
ages; I've never had a mis-recognised file.
If something looks like a text file (no NULs), you do the following:
- scan the first 4kB or so of the file for bytes >= 128. If there
are none, it's ASCII;
- otherwise, try to decode the first 4kB as UTF-8. If it's
successful, it's UTF-8. 0% false positive rate, unless someone
names his variables « ê ».
If something looks like it's binary (loads of NULs), compute the
number of NULs and of NLs in even and odd positions in the first 4kB.
Then do
if nul-odd = 0 then
if nl-odd > 1% and nul-even > 2% then
it's UTF-16BE -- yuck
else
it's binary
else if nul-even = 0 then
if nl-even > 1% and nul-even > 2 % then
it's UTF-16LE -- even more yuck
else
it's binary
Reliably recognising the PDP-endian form of UCS-4 (or whatever else
the Unicode consortium decide to shove down our collective throat in
the next revision of the standard) is left as an exercise for the
(very) interested reader.
Juliusz
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel