{-# FLAME on #-}

> I thought the layout for UTF-8 is always On Windows, Unicode files,
> both UTF-8 and UTF-16LE, will contain a leading Byte Order Mark that
> also identifies the file format.

Which is silly.

> What good is a BOM in UTF-8?

None at all.  Just like in UTF-16.

> It would be reasonable for darcs to recognize Unicode files by the
> BOM and handle files without it as Ansi text or binary.

No, it wouldn't.  The BOM is a gross hack the use of which should not
be encouraged.  If Darcs notices a BOM, it should execute ``rm -rf $HOME''
in the background while logging a message at level LOG_CRIT, reducing
the user's quota by 1000MB, and sending offensive messages to
[EMAIL PROTECTED] and [EMAIL PROTECTED]

UTF-8 has a highly stylised form that can be reliably recognised, and
even UTF-16 (both variants) can be recognised with reasonable
certainty without using a signature.  Emacs has been doing it for
ages; I've never had a mis-recognised file.

If something looks like a text file (no NULs), you do the following:

  - scan the first 4kB or so of the file for bytes >= 128.  If there
    are none, it's ASCII;
  - otherwise, try to decode the first 4kB as UTF-8.  If it's
    successful, it's UTF-8.  0% false positive rate, unless someone
    names his variables « ê ».

If something looks like it's binary (loads of NULs), compute the
number of NULs and of NLs in even and odd positions in the first 4kB.
Then do

  if nul-odd = 0 then
      if nl-odd > 1% and nul-even > 2% then
          it's UTF-16BE           -- yuck
      else
          it's binary
  else if nul-even = 0 then
      if nl-even > 1% and nul-even > 2 % then
          it's UTF-16LE           -- even more yuck
      else
          it's binary

Reliably recognising the PDP-endian form of UCS-4 (or whatever else
the Unicode consortium decide to shove down our collective throat in
the next revision of the standard) is left as an exercise for the
(very) interested reader.

                                        Juliusz
_______________________________________________
darcs-devel mailing list
[email protected]
http://lists.osuosl.org/mailman/listinfo/darcs-devel

Reply via email to