[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834421#action_12834421 ]
Robert Muir edited comment on LUCENE-2091 at 2/16/10 8:09 PM: -------------------------------------------------------------- Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc) edit here is the quote: for simplicity {noformat} Using the original idf formula idf =log[(n−dfj +0.5)/(dfj +0.5)], we have noticed that when the underlying term tj occurs in more than half of the documents (dfj >n/2), the resulting idf value would be negative, and the final document score also could be negative. As a means of estimating idf,we therefore suggest a new variant defined as idf =log{1+[(n−dfj +0.5)/(dfj +0.5)]}. {noformat} was (Author: rcmuir): Joaquin, have you seen this paper: http://doc.rero.ch/lm.php?url=1000,43,4,20091218142456-GY/Dolamic_Ljiljana_-_When_Stopword_Lists_Make_the_Difference_20091218.pdf Its of interest how they modified BM25's idf formula slightly in a way to improve results when no stopwords list is used. I'm curious what you think about this as it looks like a potential improvement for people not using stopwords (multilingual situation, etc) > Add BM25 Scoring to Lucene > -------------------------- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Yuval Feinstein > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2091.patch, persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org