[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12785840#action_12785840 ]
Joaquin Perez-Iglesias edited comment on LUCENE-2091 at 12/4/09 10:19 AM: -------------------------------------------------------------------------- Yes sorry. Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency). In Lucene this is carried out with the root square of the frequency, another classical approach is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency. BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection. (Uwe) Related with the IDF issue, I believe that the more correct approach (in theoretical terms), would be to use the docFreq on the fields where the user wants to search but I don't think that this can be done. For example if we have indexed with 3 fields. F1, F2, F3, and the user want to search on F1, and F2 there is no way to compute docFreq in both fields. With a catch-all field we have docFreq for all fields. So maybe the best available approach would be to use IDF per field. What do you think? was (Author: joaquin): Yes sorry. Basically what we are trying is to constraint the effect of the raw frequency (saturate the frequency). In Lucene this is carried out with the root square of the frequency, another classical approach is to use the log. With both approaches we avoid giving a linear 'importance' to the frequency. BM25 is a bit tricky, it parametrises the 'saturation' of the frequency with a parameter k1, with the equation weight(t)/(weight(t)+k1). Usually k1 is fixed to 2, but it can be fixed by collection. > Add BM25 Scoring to Lucene > -------------------------- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Yuval Feinstein > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2091.patch, persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org