[jira] [Commented] (SOLR-2761) FSTLookup should use long-tail like discretization instead of proportional (linear)

Dawid Weiss (JIRA) Wed, 14 Sep 2011 13:12:33 -0700

    [ 
https://issues.apache.org/jira/browse/SOLR-2761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13104845#comment-13104845
 ]


Dawid Weiss commented on SOLR-2761:
-----------------------------------

I guess a lot depends on the use case. In my case quantization was not a 
problem (the scores were "rough" and query independent anyway, so they did fall 
into corresponding buckets). "poor" performance would then have to be backed by 
what the requirement really is -- if one needs sorting by exact scores then the 
method used to speed up FSTLookup simply isn't a good fit. Still, compared to 
fetching everything and resorting this is a hell of a lot faster, so many folks 
(including me) may find it helpful.

It all depends, in other words.

As for using more buckets -- sure, you can do this. In fact, you can combine 
both approaches and use quantization to prefetch a buffer of matches, then 
collect outputs, sort and if this fills your desired number of results then 
there is no need to search any further because all other buckets will have 
lower scores (exact).

> FSTLookup should use long-tail like discretization instead of proportional 
> (linear)
> -----------------------------------------------------------------------------------
>
>                 Key: SOLR-2761
>                 URL: https://issues.apache.org/jira/browse/SOLR-2761
>             Project: Solr
>          Issue Type: Improvement
>          Components: spellchecker
>    Affects Versions: 3.4
>            Reporter: David Smiley
>            Priority: Minor
>
> The Suggester's FSTLookup implementation discretizes the term frequencies 
> into a configurable number of buckets (configurable as "weightBuckets") in 
> order to deal with FST limitations. The mapping of a source frequency into a 
> bucket is a proportional (i.e. linear) mapping from the minimum and maximum 
> value. I don't think this makes sense at all given the well-known long-tail 
> like distribution of term frequencies. As a result of this problem, I've 
> found it necessary to increase weightBuckets substantially, like >100, to get 
> quality suggestions. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-2761) FSTLookup should use long-tail like discretization instead of proportional (linear)

Reply via email to