Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

Greg Navis Wed, 08 Jun 2016 02:17:07 -0700

Thanks for the replies.

On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes <[email protected]> wrote:


> On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <[email protected]> wrote:
> > Thanks for your replies.
> >
> > Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`,
> > which of course is completely equivalent to `lhs % rhs`, I wanted to
> write
> > `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach
> > with ternary operators is quite a bit of work. I might have a simpler
> idea:
>
> The problem is that you would have to build your index on the
> expression (similarity(lhs,rhs)), and the then one of the lhs or rhs
> would have to be a column, and the other would have to be a constant
> string at the time you define the index, which is the same constant
> string as specified when you want to use the index.  This could work
> as long as there are only a few strings you ever care about checking
> similarity to, and you would build an index for each one.  These would
> be ordinary btree indexes, not gin indexes.   I've done this before,
> only using substring rather than similarity.


>
> > pg_trgm also provides `<->` but it seems this operator doesn't use
> indexes
> > either. It seems the shortest path to per-query thresholds, without
> > compromising the design, is making this operator use the index.
>
> To be efficient, the index has to know where to start and stop its
> scan.  Such a thing doesn't exists for "a<->b", there is no stopping
> point.  It would have to be "a <-> b < c", at which point you are back
> to ternary again.
>
Sorry, that was sloppy thinking on my end. I was fixated on looking for
something simple and saw things that aren't there. ;-) You're absolutely
right that we're back to ternary with `a <-> b < c`.

>
> > Please help
> > me understand whether my reasoning is correct. If it is, I'd appreciate a
> > high-level overview of what needs to be done. I can block a few hours to
> > work on this in the upcoming weeks.
>
> I think that you are proposing rewriting the indexing architecture
> from scratch.  Doing that in a way that is robust and tested and
> backward compatible would  probably take years, not hours, for a
> single individual who is not already extensively experienced in the
> internals of the code.  And, it is not really clear even what you want
> to rewrite it into.
>
No rewrites, please. ;-) I was just confused. I'm looking for a simple,
incremental contribution, not a rewrite.

>
> Tom's proposal is to do it in a way that works with the existing
> architecture, rather than trying to re-write that architecture.
> (Which could probably not be done in a few hours, either, especially
> not once you write documentation and upgrade scripts and all the other
> stuff that goes into production-quality code.)
>
Thanks for the info. I blocked a few hours for open source work every week
so whether it takes 10 or 40 hours shouldn't be a problem.

>
> I don't know if this would even be appropriate as an addition to
> pg_trgm.  We might want to fork that code instead.  That would be a
> shame, because the underlying c code would be the fundamentally the
> same, but the alternative would be to force people who like % and
> set_limit() to carry around the baggage of new operators and types
> they have no interest in using, and vice versa.  True, we did just add
> several new functions and operators to pg_trgm that many people will
> have no interest in, so maybe that is not a big deal.
>
> Cheers,
>
> Jeff
>

Would this be a better plan then:

1. Add support for trigram operators.
2. Implement `issimilar(lhs, rhs, threshold)`.
3. Add `issimilar` to the trigram operator classes.

PS Should we move this discussion to pgsql-hackers?
-- 
Greg Navis
I help tech companies to scale Heroku-hosted Rails apps.
Free, biweekly scalability newsletter for SaaS CEOs
<http://www.gregnavis.com/newsletter/>

Re: [GENERAL] [pg_trgm] Making similarity(?, ?) < ? use an index

Reply via email to