[ 
https://issues.apache.org/jira/browse/LUCENE-3298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162156#comment-13162156
 ] 

Carlos González-Cadenas commented on LUCENE-3298:
-------------------------------------------------

Hello Dawid:

First of all, for the FSTLookup we're using a FST and not a FSA, that means 
that we've replaced the NoOutputs with ByteSequenceOutputs. The reason for this 
is that our autocomplete system requires to get some metadata about the 
sentence being matched, because we do some post-match reranking of the results 
in order to get quality suggestions (the buckets themselves are quite big and 
the results are quite poor without reranking, specially for short prefixes)

The queries we use with the FSTLookup are generated using templates. Our domain 
is hotels, simplifying things that means that we have K templates (i.e. an 
example template could be hotels + amenity + city), then we have I individuals 
(i.e. all the cities in the world ), and we generate queries like:

hotels with jacuzzi in barcelona

The output for this key contains some metadata that is useful for reranking 
matches or displaying the matches to the user. We encode it in as few bytes as 
possible, and we make sure that things that are common to many sentences are 
put first so that they can get compacted appropriately. The metadata in the 
output is the following (in order):

1) Display version of the sentence. The match sentence described above is 
normalized and transliterated to ASCII, i.e. it converts acutes to normal 
characters, and stuff like this. We have the "display" version to show the 
correctly formatted version to the user. We apply simple transliteration rules 
so that the differnece between the display and the match is not very big (i.e. 
, we don't apply stuff like pinyin transliteration for chinese).
2) The ID of the city (an int that will help us to later execute the query and 
give the right results)
3) The "score" of the sentence (float, used to rerank the matches)
4) A byte to represent the number of alternate names used in the query (i.e. 
"barcelona" is also referred as "la ciudad condal", if the prefix is "hotels 
in" we don't want to produce both matches, just the most common one. This value 
is used for reranking.
5) The number of skipped tokens (we use this for infix matches). This is 
another byte. This is used for reranking, to give priority to prefix queries 
(skipped tokens = 0) over infix queries, and if all the matches are infix, to 
give the ones where the matched word is closer to the beginning of the 
sentence).

The sentences are quite repetitive, but we don't seem to achieve an optimal 
compression. A GZ-compressed file of 250MB is generating a FST of around 1.9GB. 

We've applied the patch to trunk, but I cannot get it to compile. Maybe the 
version that is uploaded to JIRA is out of date. Do you have a newer version?

Thanks a lot,
Carlos



                
> FST has hard limit max size of 2.1 GB
> -------------------------------------
>
>                 Key: LUCENE-3298
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3298
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: core/FSTs
>            Reporter: Michael McCandless
>            Priority: Minor
>         Attachments: LUCENE-3298.patch
>
>
> The FST uses a single contiguous byte[] under the hood, which in java is 
> indexed by int so we cannot grow this over Integer.MAX_VALUE.  It also 
> internally encodes references to this array as vInt.
> We could switch this to a paged byte[] and make the far larger.
> But I think this is low priority... I'm not going to work on it any time soon.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to