[
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162752#comment-13162752
]
Carlos González-Cadenas commented on LUCENE-3298:
-------------------------------------------------
Yeap, at the beginning of this project we tried to implement this autocomplete
system using regular inverted indexes, but the response time required for
autocomplete to work from a user perspective is very low (<50ms), and it would
be quite hard to achieve such a performance with inverted indexes.
I still think this is the way to go, but as you say we have to be careful with
the data generation part. Most of the work should be put in making sure that
the data is well distributed and organized in order to avoid combinatorial
explosion.
Let me go in detail with the sources of data permutations and the reasoning
behind them:
1) With regards to infix matches, if a user types "barcelona" we want to match
"hotels in barcelona". In order to achieve this, we generate:
hotels in barcelona => hotels in barcelona
in barcelona => hotels in barcelona
barcelona => hotels in barcelona
The FST should be able to conflate these prefixes nicely in just one path,
right?. Therefore this part shouldn't be a problem.
2) In addition, another feature we want to achieve is to be able to match
inputs without prepositions. That means that if the user types "hotels
barcelona jacuzzi", we should be able to match "hoteles in barcelona with
jacuzzi". Now the only way we envision of doing it properly is to generate this
permutation within the data:
hotels barcelona jacuzzi => hotels in barcelona with jacuzzi
I can see how this can explode the FST by creating different branches.
Theoretically this could be done at runtime without the need of generating the
data, but we don't see a way to do it in a clean way. To make things more
complicated :) we've implemented fuzzy matching at query time (we use a
levenshtein automata generated with the user input + an edit distance and then
we intersect with the FST), and this is making very complicated to do
preposition handling at query time.
3) PP permutations (i.e. hoteles in barcelona with jacuzzi and hoteles with
jacuzzi in barcelona). I don't really see a way to work around this. Probably
we need to be careful and only generate these permutations for the top-K
cities, in order to limit the potential size.
Summarizing, I believe that we can reduce the set of "bad permutations" a lot
if we can figure out how to implement the prepositions at runtime. If you have
any ideas, let me know. Thanks! :)
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
> Key: LUCENE-3298
> URL: https://issues.apache.org/jira/browse/LUCENE-3298
> Project: Lucene - Java
> Issue Type: Improvement
> Components: core/FSTs
> Reporter: Michael McCandless
> Priority: Minor
> Attachments: LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is
> indexed by int so we cannot grow this over Integer.MAX_VALUE. It also
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]