On 2/18/03 2:45 AM, Mark Crispin at <[EMAIL PROTECTED]> wrote:
 
> That's correct, and that pretty much shoots down the use of a test to
> determine if something is UTF-8.  The test can prove that text is not
> UTF-8 (assuming that the UTF-8 wasn't somehow damaged in transit), but it
> does not reliably prove that text is UTF-8.

No, it doesn't.  Let me round the numbers off to the nearest percent, and
illustrate.

*********************************************************************
There are basically two questions.
    [Q1] Does it hurt to do the test?
    [Q2] Is it beneficial to do the test?
    
Now, given 100,000 articles, and rounding off to the nearest half percent,
you get:

   100,000  : Total Articles
    100.0%  : Articles correctly identified as not UTF8 (assume "local")
      0.0%  : Articles incorrectly identified as UTF8
      0.0%  : Articles correctly identified as UTF8

So for the answers (labeled A1 and A2), you qet:

    [A1] Does it hurt?  Not at all, for all practical purposes, the user
never encounters/see's an article displayed as UTF8 when it's not.

    [A2] Does it help?  Well, let's see, the percentage for correctly
identified is exactly the same as for incorrectly identified, is that a
problem? NO! BECAUSE THE PERCENTAGE FOR BOTH IS ZERO!  So, I guess it
doesn't help.
*********************************************************************

That's the situation as it stands today.  But we hope to change A2 by
encouraging people to switch over from the "local" charset to UTF8, *if*
that happens then it *will* help because there will actually *be* articles
identified as UTF8.  Since the percentage of "incorrectly identified as
UTF8" will NOT go up, but instead go down, coming ever closer to being zero
and not just rounded down to zero, this will be an improvement.

If what we do does NOT encourage people to switch (or let us say doesn't
encourage even 1% of the people to switch), then we're no worse off than we
are today -- we have a standard that says do X and it is ignored.

NEVER does switching over to raw UTF8 make identifying the charset /harder/,
no matter how you juggle the numbers -- either UTF8 doesn't happen often
enough to bother with, or if it does then it'll be worthwhile (even a .1%
switch wwould mean that it'd be right 10x as often as it'd be wrong).
Basically, only today does it not make sense to do the test, any shift
towards it actually being used results in it being a good test.

-- 
J.B. Moreno

Reply via email to