For the japanese/english/french/german/dutch/russian/spanish/portuguese with lots of searchable metadata dictionary that I am developping for Android, I'm using a multi-field index that uses human input (a single string) and i have to

USE 1 : guess/associate each term/range to one (or more) relevant fields (field desambiguation)
USE 2 : suggest relevant terms for a given field

I managed to make it work in a satisfying way with WFSTCompletionLookup <http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html> structures, and in a not so satisfying way for terms with wildcards/regex... and ranges mainly because the Lookup interface is much too limited
for my use cases. So, I'm looking for something better


for USE 1 :

I need to QUICKLY know

option A : if there are documents  (a boolean)
option B : the  number of documents (an int)

that

1) are in a range like "{an TO bam]"
2) that have a specified term (like "an*", or "an~1" or "/[ms]ad/"


for USE 2:

I need to QUICKLY get

option C : a most frequent completion (with number of docs) for a given term like "an?b*" (WFSTCompletionLookup <http://lucene.apache.org/core/3_6_0/api/contrib-spellchecker/org/apache/lucene/search/suggest/fst/WFSTCompletionLookup.html> only does an*) option D : the set of terms (with number of docs) that satisfy a regex or a gien term like "an?b*" option E : the set of terms (with number of docs) that satisfy a range "{an TO bam]"


Basically, option D and E give me all I would need and are possible now with queries But I need it to be suggestion-quick (mobile phone/ 1s wait is too much) not query-quick

I think that the perfect structure for this would be :

for each field, a simple (in ram) ordered list/navigatable tree of term + number of docs, that would work well with Automatons (fst ?)
with the right interface say

getTerms(term : String) : ArrayList<Pair<String, Int>>
getTermsForRange(termA:String, termB:String, aIncluded : Boolean, bIncluded : Boolean): ArrayList<Pair<String, Int>>


Does something like this exist in lucene ? (in memory term dictionary ? )
If not, I will have to code one. What would be the nearest class I could use to base this structure on ?

This structure could be built once from the index, with a filter to remove docs not needed (for example, those that don't have english translations for german users...) and saved to disk/restored from disk (to avoid heavy processing on an android phone as much as possible)

Best regards,
Olivier

Reply via email to