On Wed, Nov 19, 2014 at 9:52 AM, Robert Haas <robertmh...@gmail.com> wrote:
> If you agree, then I'm not being clear enough.  I don't think think
> that tinkering with the Levenshtein cost factors is a good idea, and I
> think it's unhelpful to suggest something when the suggestion and the
> original word differ by more than 50% of the number characters in the
> shorter word.

I agree - except for very short strings, where there is insufficient
information to go on.

I was talking about the difficulty of bolting something on top of
Levenshtein distance that looked at the first character. That would
not need two costings, but would require an encoding aware matching of
the first code point.

> Suggesting "col" for "oid" or "x" for "xmax", as crops
> up in the regression tests with this patch applied, shows the folly of
> this: the user didn't mean the other named column; rather, the user
> was confused about whether a particular system column existed for that
> table.

Those are all very terse strings. What you're overlooking is what is
broken by using straight Levenshtein distance, which includes things
in the regression test that are reasonable and helpful. As I mentioned
before, requiring a greater than 50% of total string size distance
breaks this, just within the regression tests:

"""
  ERROR:  column "f1" does not exist
  LINE 1: select f1 from c1;
                 ^
- HINT:  Perhaps you meant to reference the column "c1"."f2".
"""

And:

"""
  ERROR:  column atts.relid does not exist
  LINE 1: select atts.relid::regclass, s.* from pg_stats s join
                 ^
- HINT:  Perhaps you meant to reference the column "atts"."indexrelid".
"""

Those are really useful suggestions! And, they're much more
representative of real user error.

The downside of weighing deletion less than substitution and insertion
is much smaller than the upside. It's worth it. The downside is only
that the user gets to see the best suggestion that isn't all that good
in an absolute sense (which we have a much harder time concluding
using simple tests for short misspellings).

> If we had a large database of examples showing what the user typed and
> what they intended, we could try different algorithms against it and
> see which one performs best with fewest false positives.  But if we
> don't have that, we should do things that are like the things that
> other people have done before.

That seems totally impractical. No one has that kind of data that I'm aware of.

How about git as a kind of precedent? It is not at all conservative
about showing *some* suggestion:

"""
$ git aa
git: 'aa' is not a git command. See 'git --help'.

Did you mean this?
am

$ git d
git: 'd' is not a git command. See 'git --help'.

Did you mean one of these?
diff
add

$ git ddd
git: 'ddd' is not a git command. See 'git --help'.

Did you mean this?
add
"""

And why wouldn't git be? As far as its concerned, you can only have
meant one of those small number of things. Similarly, with the patch,
the number of things we can pick from is fairly limited at this stage,
since we are actually fairly far along with parse analysis.

Now, this won't give a suggestion:

"""
$ git aaaa
git: 'aaaa' is not a git command. See 'git --help'.
"""

So it looks like git similarly weighs deletion less than
insertion/substitution. Are its suggestions any less ridiculous than
your examples of questionable hints from the modified regression test
expected output? This is just a guidance mechanism, and at worst we'll
show the best match (on the mechanisms own terms, which isn't too
bad).

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to