[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Dawid Weiss (Updated) (JIRA) Fri, 18 Nov 2011 03:29:21 -0800

     [ 
https://issues.apache.org/jira/browse/SOLR-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dawid Weiss updated SOLR-2888:
------------------------------

    Description: 
This issue incorporates several problems:
- utf16 was used previously to store and lookup terms, now it is utf8
- the construction would OOM with large number of terms because of the need to 
sort entries. Sorter APIs have been added and an implementation of external 
(on-disk) sorting is also added (Robert Muir).
- the FSTLookup class has been split and refactored into FSTCompletion and 
FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the 
pieces and implements Lookup interface. For large inputs use 
FSTCompletionBuilder directly (and pre-bucket your input weights).
- Automatic bucketing in FSTCompletionLookup has been changed from linear 
min/max discretization into dividing into  ranges after all values have been 
sorted. This empirically handles all potential distributions quite well. If 
somebody needs something very specific, use FSTCompletionBuilder directly 
(providing buckets), construct the automaton and then load it with 
FSTCompletionLookup.

  was:For some reason it uses utf16 internally. Shouldn't make much of a 
difference, really.

        Summary: FSTSuggester refactoring: utf8 storage, external sorts (OOM 
prevention), code cleanups  (was: FSTSuggester should use utf8/utf32 order )
    
> FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code 
> cleanups
> --------------------------------------------------------------------------------------
>
>                 Key: SOLR-2888
>                 URL: https://issues.apache.org/jira/browse/SOLR-2888
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>            Reporter: Dawid Weiss
>            Assignee: Dawid Weiss
>             Fix For: 4.0
>
>         Attachments: SOLR-2888.patch, SOLR-2888.patch
>
>
> This issue incorporates several problems:
> - utf16 was used previously to store and lookup terms, now it is utf8
> - the construction would OOM with large number of terms because of the need 
> to sort entries. Sorter APIs have been added and an implementation of 
> external (on-disk) sorting is also added (Robert Muir).
> - the FSTLookup class has been split and refactored into FSTCompletion and 
> FSTCompletionBuilder, FSTCompletionLookup remains a facade connecting all the 
> pieces and implements Lookup interface. For large inputs use 
> FSTCompletionBuilder directly (and pre-bucket your input weights).
> - Automatic bucketing in FSTCompletionLookup has been changed from linear 
> min/max discretization into dividing into  ranges after all values have been 
> sorted. This empirically handles all potential distributions quite well. If 
> somebody needs something very specific, use FSTCompletionBuilder directly 
> (providing buckets), construct the automaton and then load it with 
> FSTCompletionLookup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SOLR-2888) FSTSuggester refactoring: utf8 storage, external sorts (OOM prevention), code cleanups

Reply via email to