> My point is your rating was based on an assumption that was totally
> incorrect: that the developer had made a reasonable effort to put the
> right
> gauges and levers in the right place. Do you make a similar assumption
> about
> the FDM? That it is approximately right? Is there much value in such a
> rating?

Vivian, I am sorry if I'm now taking a little more of a lecturing attitude
- I do not know how much you know about mathematical statistics, but I
have the impression you are completely missing the issue here.

What the rating represents is a screening procedure. A screening procedure
is used to quickly assess a large number of something, to single out a
subset with given properties. For instance, you might screen a population
for breast cancer.

Screening procedures are designed to process large numbers, i.e. they do
not make use of all available diagnostic tools and replace detailed
knowledge by plausibility, because usually applying detailed knowledge and
detailed testing requires time and resources which are not available (a
detailed cancer test requires you to be hospitalized for maybe 1-2 days,
say that (optimistically) costs 200$, to do it for 100 Million people once
per year is 20 Billion per year (hm...)- so maybe you'd rather test less
accurately for 5$ per person). Screenings therefore often test proxies,
rather than the real property you're interested in.

For any given instance of the something, it is always true that a detailed
test has more accurate results. It is also true that a screening produces
both false positives (i.e. assigns a property to something which does in
fact not have that property) and false negatives (i.e. does not assign the
property to something which does in fact have it).

It is not required (nor reasonable to require) that a screening procedure
is always correct or that the plausibility assumptions underlying it are
always fulfilled. What is required is that the screening procedure is
right most of the time (dependent on the problem, you want to minimize the
rate of false positives, of false negatives or both - in the cancer
example, it it better to send a few more people to detailed testing than
to miss too many real cancer cases, so you try to minimize the false
negatives).

So, what you have shown with the KC-135 is a case in which a default
assumption was wrong, but in which the scheme still (for whatever reason)
gave a good answer. That's not very problematic (one wouldn't consider it
problematic if a screening test picks up a cancer for the wrong reason if
there in fact is a cancer). Right now you have shown me one example in
which the default assumption does not work. If there are no more, it means
it has an accuracy of 99.75%. If you can find as many as 40 planes with a
similar history in which the designer did not care about cockpit layout,
the default assumption would still have an accuracy of 90%. That's pretty
good to me - and the chance that the default assumption does not work but
the result is still reasonable is even better than that!

The Concorde is in some sense way more problematic, because it is actually
a 'wrong' result - a false negative (i.e. a high-quality plane gets a low
rating). But here precisely the same question arises - what is the rate of
false negatives? What is the actual probability that this happens to a
second plane in the sample?

Of course I don't factually know that (because I have no detailed test
data for all aircraft), but I can give an estimate based on the sub-sample
of planes I know better - this is where statistics comes in (I could even
compute error margins for that estimate, although I have not done that
yet). And that estimate suggests that the rate of false positives and
negatives is low (about 2.5% for a deviation of 5 points between quality
and visuals - which means that it works better than that 97.5% of the
time).  Again, this is a number which I consider entirely reasonable.

It doesn't matter if the rating works in every instance perfectly, or if
the assumptions capture every instance correctly. On average, the results
are reasonable and they give you an overview.

Having an overview picture of something with a 10% error margin is better
than having no overview at all with 1% error margin (screening 90% of a
population for cancer with a 10% rate of false positives and negatives is
way more effective than testing 1% of the population in detail with a 1%
failure rate).

*shrugs*

Codify any testing scheme you like, and I bet I can construct a case which
is somehow not adequately treated in it. It doesn't matter that I can do
that - it's the rate with which it actually happens that matters, and the
amount of resources it takes to run the scheme.

Hope that helps a bit,

* Thorsten


------------------------------------------------------------------------------
Increase Visibility of Your 3D Game App & Earn a Chance To Win $500!
Tap into the largest installed PC base & get more eyes on your game by
optimizing for Intel(R) Graphics Technology. Get started today with the
Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs.
http://p.sf.net/sfu/intelisp-dev2dev
_______________________________________________
Flightgear-devel mailing list
Flightgear-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/flightgear-devel

Reply via email to