RE: SweetSpotSimilarity

Paul Allan Hill Fri, 17 Feb 2012 11:42:10 -0800

> -----Original Message-----
> From: Chris Hostetter [mailto:hossman_luc...@fucit.org]
> As for what hyperbolicTf is trying to do ... it creates a hyperbolic function 
> letting you specify a hard max
> no matter how many terms there are.


A picture -- or more precisely a graph -- would be worth a 1000 words.  As it 
says in issue 577 "a hyperbolic tf function which is best explained by graphing 
the equation".  That's great, but I couldn't find " Mark [Bennet's] nifty  
graph [...] (linked from his email)."  Can anyone provide any help locating 
what sounds like a useful resource?

The JavaDoc (which Chris probably also wrote way back when), says hyperbolic 
TANGENT function (http://www.dplot.com/fct_tanh.htm ).  At least that clarifies 
the basic shape, even if I (and apparently others judging from the yearly 
questions on the Lucene list) have yet to work out the full impact of all the 
parameters and how hyperbolic tangent might compare to the 1 / sqrt( freq + C ) 
of the baseline which I believe, if used with the defaults, degenerates to 
DefaultSimilarity.tf formula.

Another problem mentioned in the e-mail thread Chris linked is "people who know 
the 'sweetspot' of their data.", but I have yet to find a definition of what is 
meant by "sweetspot", so I couldn't say whether I know my  data's sweet spot  
or not.
Another question is how the tf_hyper_offset parameter might be considered.  It 
appears to be the inflexion point of the tanh equation, but what term count 
might a caller consider centering there ( or consider being the approx. area 
that the graph is "mostly" level)  ?  Or more simply why 10?
Any thoughts from anyone?

I also note that the JavaDoc says that the default tf_hyper_base ("the base 
value to be used in the exponential for the hyperbolic function ") value is e. 
But checking the code the default is actually 1.3 (less than half e).  Should I 
file a doc bug?

To summarize: Does anyone have any resources along the lines of graphs of these 
(or any other) tf functions, general discussion of document collection sweet 
spot, and any insight into  parameters of this class (hyperbolic tangent or 
otherwise)?

-Paul


> 
> : > And I am aware that SweetSpotSimilarity resulted from this paper
> : >
> : > http://trec.nist.gov/pubs/trec16/papers/ibm-haifa.mq.final.pdf
> 
> For the record, that paper did not result in SSS -- I wrote SSS ~Dec 2005 and 
> contributed it to Apache a
> few months later on behalf of CNET Networks where i developed it to solve 
> some specific problems
> we had with product data...
> 
> https://issues.apache.org/jira/browse/LUCENE-577
> http://mail-archives.apache.org/mod_mbox/lucene-dev/200605.mbox/%3CF9F270C4-FA1E-460F-
> A54F-E2E56AAD0286%40rectangular.com%3E
> (and subsequent replies)
> 
> ...Doron wrote the paper later, although you'll note lots of dicsussions 
> arround that time on the
> mailing list about customizing Similarity based on domain specific data -- 
> the concepts certainly weren't
> novel.
> 
> : > However, I was wondering if there was a resource that explained (and gave 
> examples) of how SSS
> : > works and what each parameter (hyperbolic, etc) means. I know this is a 
> Lucene list but I am
> actually
> 
> The functions are pretty clearly spelled out in the javadocs -- you just set 
> the options on the class to
> control the constant values of the functions.  The easiest way to understand 
> them is probably to use
> something like gnuplot to graph them using various values for the constants, 
> and then compare to
> graphs of the corrisponding functions from DefaultSimilarity.
> 
> 
> 
> 
> -Hoss
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: SweetSpotSimilarity

Reply via email to