RE: Partial word match using n-grams

Becker, Thomas Fri, 19 Jul 2013 05:46:42 -0700

In general the data for this field is that simple, but additional characters 
are allowed beyond [a-z_].  Do I need to tokenize on whitespace?  I really 
don't know.  Essentially, the question is whether we expect "quota tom" to 
match quota_tom or not.  I spoke to some colleagues and they thought it should 
since both "quota" and "tom" are partial matches that would AND together.  
Tokenizing the entire input whitespace and all precludes this match.  I'd 
appreciate some input from anyone on what the best user experience would be 
here; I'm trying to operate on principle of least surprise ;)

With regard to the padding suggestion, I'm still not sure this will work.  
Because again at indexing time there is typically no whitespace.  So padding 
"quota_tommy_1234" to "## quota_tommy_1234##" before trigramming is not going 
to produce a to#  token that I would need in order for "quota to" to match.

-----Original Message-----
From: Allison, Timothy B. [mailto:talli...@mitre.org] 
Sent: Friday, July 19, 2013 7:58 AM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Got it...almost.  

Y. You're right. FuzzyQuery is not at all what you want.

Don't know if your data is actually as simple as this example.  Do you need to 
tokenize on whitespace?   Would it make sense to replace spaces in the query 
with underscores and then trigramify the whole query as if it were a single 
term?  

________________________________________
From: Becker, Thomas [thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 8:59 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Thanks for the reply Tim.  I really should have been clearer.  Let's say I have 
an object named "quota_tommy_1234".  I'd like to match that object with any 3 
character (or more) substring of that name.  So for example:

quo
tom
234
quota
etc.

Further, at search time I'm splitting input on whitespace before tokenizing 
into PhraseQueries and then ANDing them together.  So using the example above I 
also want the following queries to match:

quo tom
quo 234
quota to <- this is the problem because there are no trigrams of "to"

That said, in response to your points:

1)  Not sure FuzzyQuery is what I need; I'm not trying to match via 
misspellings, which is the main function of FuzzyQuery is it not?

2) The original names are all going to be > 3 characters, so there are no 1 or 
2 letter terms at indexing time.  So generating the bigram "to" at search time 
will never match anything, unless I switch to bigrams at indexing time also, 
which is what I'm asking about.

3)  Again the names are all > 3 characters so I don't need to pad at indexing 
time.

4) Hopefully my explanation above clarifies.

I should point out that I'm a Lucene novice and am not at all sure that what 
I'm doing is optimal.  But I have been impressed with how easy it is to get 
something working very quickly!

________________________________________
From: Allison, Timothy B. [talli...@mitre.org]
Sent: Thursday, July 18, 2013 7:49 PM
To: java-user@lucene.apache.org
Subject: RE: Partial word match using n-grams

Tommy,
  I'm sure that I don't fully understand your use case and your data.  Some 
thoughts:

1) I assume that fuzzy term search (edit distance <= 2) isn't meeting your 
needs or else you wouldn't have gone the ngram route.  If fuzzy term search + 
phrase/proximity search would meet your needs, see if ComplexPhraseQueryParser 
would work (although it looks like you're already building your own queries).

2) Would it make sense to modify NGramFilter so that it outputs a bigram for a 
two letter term and a unigram for a one letter term?  Might be messy...and "ab" 
in this scenario would never match "abc"

3) Would it make sense to pad your terms behind the scenes with "##"...this 
would add bloat, but not nearly as much as variable gram sizes with 1<= n <=3

ab -> ##ab## yields trigrams ##a, #ab, ab#, b##

4) How partial and what types of partial do you need?  This is related to 1).  
If minimum edit distance is sufficient; use it, especially with the blazing 
fast automaton (thank you, Robert Muir). If you have a smallish dataset you 
might consider allowing leading wildcards so that you could easily find all 
words, for example, containing abc with *abc*.  If your dataset is larger, you 
might consider something like ReversedWildcardFilterFactory (Solr) to speed 
this type of matching.

I look forward to other opinions from the list.

-----Original Message-----
From: Becker, Thomas [mailto:thomas.bec...@netapp.com]
Sent: Thursday, July 18, 2013 3:55 PM
To: java-user@lucene.apache.org
Subject: Partial word match using n-grams

One of our main use-cases for search is to find objects based on partial name 
matches.  I've implemented this using n-grams and it works pretty well.  
However we're currently using trigrams and that causes an interesting problem 
when searching for things like "abc ab" since we first split on whitespace and 
then construct PhraseQuerys containing each trigram yielded by the "word".  
Obviously we cannot get a trigram out of "ab".  So our choices would seem to be 
either discard this part of the search term which seems unwise, or to reduce 
the minimum n-gram size.  But I'm slightly concerned about the resulting bloat 
in both the of number of Terms stored in the index as well as contained in 
queries.  Is this something I should be concerned about?  It just "feels" like 
a query for the word "abcdef" shouldn't require a PhraseQuery of 15 terms 
(assuming n-grams 1,3).  Is this the best way to do partial word matches?  
Thanks in advance.

-Tommy

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Partial word match using n-grams

Reply via email to