Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by MichaelStack:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  
  ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
  
- Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "td" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query i
 tself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc].  Lucene scoring looks to be based on the Vector 
Space Model of Information Retrieval science.  Roughly, the score for a 
particular document in a set of query results, "score(q,d)", is the sum of the 
score for each term of a query ("t in q"). A terms score in a document is 
itself the sum of the term run against each field that comprises a document 
("title" is one field, "url" another. A "document" is a set of "fields"). Per 
field, the score is the product of the following factors: Its "td" (term 
freqency in the document), a score factor "idf" (a factor made up of frequency 
of term relative to amount of docs in index), an index-time boost, a 
normalization of count of terms found relative to size 
 of document ("lengthNorm"), a similar normalization is done for the term in 
the query itself ("queryNorm"), and finally, a factor with a weight for how 
many instances of the total amount of terms a particular document contains. 
Study the lucene javadoc to get more detail on each of the equation components 
and how they effect overall score.
  
  Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene 
scoring equation in mind. First, notice how we move right as we move from 
"score total", to "score per query term", to "score per query document field" 
(A document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).
  


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

Reply via email to