Suggesters

Olivier Binda Sun, 06 Apr 2014 08:19:13 -0700

For the japanese/english/french/german/dutch/russian/spanish/portuguesewith lots of searchable metadata dictionary that I am developping forAndroid, I'm using a multi-field index that uses human input (a singlestring) and i have to

USE 1 : guess/associate each term/range to one (or more) relevant fields(field desambiguation)

USE 2 : suggest relevant terms for a given field

I managed to make it work in a satisfying way with WFSTCompletionLookup<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>structures,and in a not so satisfying way for terms with wildcards/regex... andranges mainly because the Lookup interface is much too limited

for my use cases. So, I'm looking for something better


for USE 1 :

I need to QUICKLY know

option A : if there are documents  (a boolean)
option B : the  number of documents (an int)

that

1) are in a range like "{an TO bam]"
2) that have a specified term (like "an*", or "an~1" or "/[ms]ad/"


for USE 2:

I need to QUICKLY get

option C : a most frequent completion (with number of docs) for a giventerm like "an?b*" (WFSTCompletionLookup<http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html>only does an*)option D : the set of terms (with number of docs) that satisfy a regexor a gien term like "an?b*"option E : the set of terms (with number of docs) that satisfy a range"{an TO bam]"

Basically, option D and E give me all I would need and are possible nowwith queriesBut I need it to be suggestion-quick (mobile phone/ 1s wait is too much)not query-quick


I think that the perfect structure for this would be :

for each field, a simple (in ram) ordered list/navigatable tree of term+ number of docs, that would work well with Automatons (fst ?)

with the right interface say

getTerms(term : String) : ArrayList<Pair<String, Int>>

getTermsForRange(termA:String, termB:String, aIncluded : Boolean,bIncluded : Boolean): ArrayList<Pair<String, Int>>



Does something like this exist in lucene ? (in memory term dictionary ? )

If not, I will have to code one. What would be the nearest class I coulduse to base this structure on ?

This structure could be built once from the index, with a filter toremove docs not needed (for example, those that don't have englishtranslations for german users...) and saved to disk/restored from disk(to avoid heavy processing on an android phone as much as possible)


Best regards,
Olivier

Suggesters

Reply via email to