[ https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754610#action_12754610 ]
Doron Cohen commented on LUCENE-1908: ------------------------------------- {quote} I'm still a little confused I guess {quote} That makes too of us... :) {quote} The longer docs will have larger weights naturally is what I meant, but larger weights actually hurts in the cosine normalization - so it actually over punishes I guess? I dunno - all of this over punish/ under punish is in comparison to a relevancy curve they figure out ( a probability of relevance as a function of document length), and how the pivoted cosine curves compare against it. I'm just reading across random interweb pdfs from google. Apparently our pivot also over punishes large docs and over favors small, the same as this one, but perhaps not as bad ? I'm seeing that in a Lucene/Juru research pdf. This stuff is hard to grok on first pass. {quote} In that work we got best results from Lucene (for TREC) with SweetSpot similarity and normalizing tf by average term-freq in doc. For me it was mainly experimental and intuitive, but I was not able to convince Hoss (or even convince myslf once I read Hoss comments) that this was justified theoretically. At that time I was not aware of the V(d) normalization delicacy we are discussing now. I think I understand things better now, and still I am not sure. Need to read http://nlp.stanford.edu/IR-book/html/htmledition/pivoted-normalized-document-length-1.html and sleep on it... > Similarity javadocs for scoring function to relate more tightly to scoring > models in effect > ------------------------------------------------------------------------------------------- > > Key: LUCENE-1908 > URL: https://issues.apache.org/jira/browse/LUCENE-1908 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Doron Cohen > Assignee: Doron Cohen > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1908.patch, LUCENE-1908.patch > > > See discussion in the related issue. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org