Terry, Otis, Clemens, One more try to correct the lucene-dev email address. Monday evening blues I guess...
>Ype, > >I couldn't find/open any attachment. Would you try to send it to me >directly? I'd very much like to read and help revise the document. O well, I changed email program and it _said_ it would attach. Anyway, here is the html, it's not very clean, but it displays nicely here. Sorry for the teaser, Ype <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="GENERATOR" content="Mozilla/4.79 (Macintosh; U; PPC) [Netscape]"> <title>lcnscoring.html</title> </head> <body> <h1> Scoring in Lucene</h1> <h2> Introduction</h2> The scoring capabilities of the Lucene search engine are explored. The target audience is primarily the users of Lucene. The intention is to help these users understand the default document scores computed by Lucene for their queries, and how the default scoring mechanism can be adapted. <p>The Lucene scoring mechanism in Lucene 1.3dev is used in this document.This interface is available as java class Similarity in package org.apache.lucene.search from the Lucene web site http://jakarta.apache.org/lucene, revision 1.2, 29 Jan 2003. <h2> Basic operation of Lucene</h2> In Lucene, a document consists of fields, and a field consists of terms. To execute queries Lucene analyses and indexes documents. <h3> Indexing and querying</h3> During indexing, an analyser software component extracts terms from each document.The analyser normally splits the documents into fields and words and removes stop words.For each field these words are then stored in a Lucene index as terms. <p>Once the index is ready it can be used for querying. A query consists of terms and phrases and results in a list of scored documents. <p>During query term preprocessing the query terms are normally analysed by the same analyzer that was used to build the index. At this point weights for terms and phrases are established. <p>During query search the indexes and the weights are used to assign a score to each document. <h3> Term based scoring</h3> The scoring part of Lucene is called the Similarity interface, because it determines how similar a document is to a query. It is also referred to as the 'custom scoring API'. <p>Lucene has a default implementation of this Similarity interface. By using another implementation the scoring can be changed. Such a change requires a straightforward program adaptation. This document indicates which Java methods of the Similarity interface can be changed. Each of these java methods represents a part of the scoring mechanism. <p>For scoring the following aspects of the Lucene query language are important: fields, terms, phrases, and query weights. <p>The default scoring method is based on an well established scoring mechanism for simple terms. It uses the logarithmic inverse document frequency for the term. <p>(Check: give a reference?). <p>Other query elements are truncated and imprecise terms and phrases. Their scoring is done by working back to this well established term scoring mechanism. <h3> Scoring of truncation and imprecision</h3> The Lucene query language allows truncation and imprecise queries (~ operator for terms and phrases), their influence on scoring is currently not completely known to the author. <h2> Field weighing during indexing</h2> For each document, field length and name dependencies must be set during document indexing time, The field default weight is the inverse square root of the number of terms it contains for the document. <p>This field weighting can be used to give the fields of some documents an advantage over other documents, eg. as a result of citation analysis. <p>Field weighting can also be used to provide a minimum length for short fields (eg. titles) in order to prevent these from scoring high only because of their short length. <p>Another use is to provide a higher weight for fields with a priori known higher relevance. <p>Since field weights for queries can be also adapted in queries, field weighting during indexing is most useful to distinguish documents from each other. <p>Field weights in the index have about 1 decimal digit (3 bits) precision, they are stored as a single byte for each field of each <br>document. <p>Changes in the document field weights require that all documents are reindexed. <p>Java method: lengthNorm(fieldName, numberOfTerms) <h2> Query term preprocessing</h2> Initial query processing retrieves the document frequencies of the terms in the query. This is combined with the query weights to form the term and phrase weights to be used during query search. <h3> Query term weight and query phrase weight</h3> Since this is part of the query language there is no corresponding java method in the Lucene Similarity API. <p>The term or phrase weight given the query (default 1) is the first term weighing factor. It can be used ao. to compensate for unwanted effects of other term weighting as described below. <p>The Lucene query language allows to require the presence of a query term or phrase in a document field. Higher term or phrase frequencies within scored documents can eg. be obtained by using a higher query weight. <h3> Inverse document frequency of a term</h3> Another weight for a term within a query can depend on its document frequency (the number of documents in which the term occurs) and the total number of documents taking part in the search. The default for this weight is (1 + log(numDocs/(docFreq + 1))), ie. a <br>term score is lower when more documents contain the term. <p>Check: document frequencies of truncated query terms and imprecise query terms. <p>Java method: idf(docFreq, numDocs), idf stands for 'inverse document frequency'. <h3> Inverse document frequency of a phrase</h3> <p><br>For a phrase the document frequency is not available before the query is evaluated. By default, the inverse document frequencies of the individual terms in a phrase are summed to provide a phrase weighing factor. <p>Java method: idf(terms, searcher) <br>The searcher here is the java object that executes the query. <h3> Query norm</h3> To make scores from different queries comparable, a query norm function is used, which is provided with the sum of the squared weights of all the query terms. This function does not affect the ranking order for a single query. <br>By default this function is the inverse of the square root. <p>Java method: queryNorm( sumOfQuaredWeights). <br> <h2> Query search</h2> For each document that satisfies the query, the search extracts the following information from the indexes. Here 'field' is used in the <br>sense of a field of the document being searched. <ol> <li> - field weight of the field in which a query term or phrase occurs,</li> <li> - query term frequency within the field,</li> <li> - query phrase frequency within the field,</li> <li> - edit distance for imprecise query terms and query phrases within the field,</li> <li> - the number of different query terms within the document.</li> </ol> <h3> Term or phrase frequency in a document</h3> The frequency of a term or phrase within a document field is available for scoring. By default, the square root of this frequency is used. <p>Java method: tf( frequency) <h3> Imprecise occurrences</h3> For imprecise phrase matches, the 'edit distance' to a phrase is also available for scoring. The edit distance is a measure of how imprecise the match is. <p>It is used to compute the contribution of the match to the total frequency of the phrase in the document field. By default this 'sloppy <br>frequency contribution' is 1/(distance + 1). <p>The precise meaning of 'edit distance' needs further investigation. <p>Check: <br>For phrases the edit distance is computed using term proximity information from the index. <br>For terms the edit distance is the minimum number of single character edits (modify, insert, delete) between the query term and the occurring term. <p>Java method: sloppyFreq( distance) <h3> Query document overlap</h3> The number of different query terms that a document contains (ie. the overlap) and the number of terms in the query are used for another factor indicating how well the document matches the query as a whole. This allows to take into account the number of different non required terms occuring in the document. <p>Check: As truncated query terms are equivalent to the OR of all matching terms in the index, truncation can result in a large maxOverlap. <p>By default this factor is (overlap / nrQueryTerms). (The API documentation uses maxOverlap for nrQueryTerms, to be investigated). <p>Java method: coord(overlap, maxOverlap) <h2> Scoring formulas</h2> The following formulas determine how the document score for a query is computed. <br> <h3> Query preprocessing</h3> <table BORDER COLS=3 WIDTH="100%" > <tr> <td>numDocs</td> <td> </td> <td>The number of documents in the database, from the index</td> </tr> <tr> <td>docFreq</td> <td> </td> <td>The number of documents in which a query term occurs, from the index</td> </tr> <tr> <td>qtw</td> <td> </td> <td>The query weight of a term or phrase, from the query</td> </tr> <tr> <td>tw</td> <td>qtw * idf(docFreq, numDocs)</td> <td>Weight of a term in the query</td> </tr> <tr> <td>tw</td> <td>qtw * idf(terms, searcher)</td> <td>Weight of a phrase in the query</td> </tr> <tr> <td>qn</td> <td>queryNorm( SUM(tw * tw))</td> <td>The query norm, summing over all terms and phrases in the query</td> </tr> </table> <br> <h3> Query search</h3> During search the actual occurences of terms and phrases in the document are taken into account. Here 'field' is used in the sense of a field of the document being scored. Occurrence is used in the sense of occurence in a field. <br> <br> <table BORDER COLS=3 WIDTH="100%" > <tr> <td>freq</td> <td> </td> <td> <br> Number of times a term occurs in a field, from the index.</td> </tr> <tr> <td>distance</td> <td> </td> <td> See 'Imprecise occurrences'.</td> </tr> <tr> <td> <br>overlap</td> <td> </td> <td> See 'Query document overlap'.</td> </tr> <tr> <td>maxOverlap</td> <td> </td> <td> See 'Query document overlap'.</td> </tr> <tr> <td>fw</td> <td>lengthNorm(fieldName, numberOfTerms)</td> <td>Field weight of the field of an occurrence</td> </tr> <tr> <td>tfs </td> <td>tw * fw * tf( freq)</td> <td> Score of a term in a field</td> </tr> <tr> <td>ssf</td> <td>SUM(sloppyFreq( distance))</td> <td>'Frequency' of imprecise occurrences</td> </tr> <tr> <td>tfs</td> <td>tw * fw * tf(ssf))</td> <td> Score of imprecise occurrences</td> </tr> <tr> <td>tds</td> <td>SUM(tfs)</td> <td>Total score of occurrences</td> </tr> <tr> <td>crd</td> <td>coord(overlap, maxOverlap)</td> <td>Query document overlap</td> </tr> <tr> <td>docscore</td> <td>qn * tds * crd </td> <td>Document score for the query</td> </tr> </table> </body> </html> -- --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
