Thanks for the replies. On Sat, Jun 4, 2016 at 8:48 PM, Jeff Janes <jeff.ja...@gmail.com> wrote:
> On Sat, Jun 4, 2016 at 2:50 AM, Greg Navis <cont...@gregnavis.com> wrote: > > Thanks for your replies. > > > > Sorry for confusion. Instead of `similarity(lhs, rhs) >= show_limit()`, > > which of course is completely equivalent to `lhs % rhs`, I wanted to > write > > `similarity(lhs, rhs) >= my_custom_threshold`. It seems that the approach > > with ternary operators is quite a bit of work. I might have a simpler > idea: > > The problem is that you would have to build your index on the > expression (similarity(lhs,rhs)), and the then one of the lhs or rhs > would have to be a column, and the other would have to be a constant > string at the time you define the index, which is the same constant > string as specified when you want to use the index. This could work > as long as there are only a few strings you ever care about checking > similarity to, and you would build an index for each one. These would > be ordinary btree indexes, not gin indexes. I've done this before, > only using substring rather than similarity. > > > pg_trgm also provides `<->` but it seems this operator doesn't use > indexes > > either. It seems the shortest path to per-query thresholds, without > > compromising the design, is making this operator use the index. > > To be efficient, the index has to know where to start and stop its > scan. Such a thing doesn't exists for "a<->b", there is no stopping > point. It would have to be "a <-> b < c", at which point you are back > to ternary again. > Sorry, that was sloppy thinking on my end. I was fixated on looking for something simple and saw things that aren't there. ;-) You're absolutely right that we're back to ternary with `a <-> b < c`. > > > Please help > > me understand whether my reasoning is correct. If it is, I'd appreciate a > > high-level overview of what needs to be done. I can block a few hours to > > work on this in the upcoming weeks. > > I think that you are proposing rewriting the indexing architecture > from scratch. Doing that in a way that is robust and tested and > backward compatible would probably take years, not hours, for a > single individual who is not already extensively experienced in the > internals of the code. And, it is not really clear even what you want > to rewrite it into. > No rewrites, please. ;-) I was just confused. I'm looking for a simple, incremental contribution, not a rewrite. > > Tom's proposal is to do it in a way that works with the existing > architecture, rather than trying to re-write that architecture. > (Which could probably not be done in a few hours, either, especially > not once you write documentation and upgrade scripts and all the other > stuff that goes into production-quality code.) > Thanks for the info. I blocked a few hours for open source work every week so whether it takes 10 or 40 hours shouldn't be a problem. > > I don't know if this would even be appropriate as an addition to > pg_trgm. We might want to fork that code instead. That would be a > shame, because the underlying c code would be the fundamentally the > same, but the alternative would be to force people who like % and > set_limit() to carry around the baggage of new operators and types > they have no interest in using, and vice versa. True, we did just add > several new functions and operators to pg_trgm that many people will > have no interest in, so maybe that is not a big deal. > > Cheers, > > Jeff > Would this be a better plan then: 1. Add support for trigram operators. 2. Implement `issimilar(lhs, rhs, threshold)`. 3. Add `issimilar` to the trigram operator classes. PS Should we move this discussion to pgsql-hackers? -- Greg Navis I help tech companies to scale Heroku-hosted Rails apps. Free, biweekly scalability newsletter for SaaS CEOs <http://www.gregnavis.com/newsletter/>