[Nutch Wiki] Update of "FAQ" by MichaelStack

Apache Wiki Mon, 21 Nov 2005 13:17:35 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by MichaelStack:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
First cut at explaination of default scoring in nutch

------------------------------------------------------------------------------
  
  ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
  
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "td" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query i
 tself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
+ 
+ Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene 
scoring equation in mind. First, notice how we move right as we move from 
"score total", to "score per query term", to "score per query document field" 
(A document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).
+ 
+ ==== How can I influence Nutch scoring? ====
+ 
+ The easiest way to influence scoring is to change query time boosts (Will 
require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost 
by default looks like this:
+ 
+   query.url.boost, 4.0f
+   query.anchor.boost, 2.0f
+   query.title.boost, 1.5f
+   query.host.boost, 2.0f
+   query.phrase.boost, 1.0f
+ 
+ From the list above, you can see that terms found in a document URL get the 
highest boost with anchor text next, etc.
+ 
+ Anchor text makes a large contribution to document score (You can see the 
anchor text for a page by browsing to "explain" then editing the URL to put in 
place "anchors.jsp" in place of "explain.jsp").
+ 
+ 
+ ---- /!\ '''Edit conflict - other version:''' ----
+ ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
+ 
  Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document (''title'' is one field, ''url'' 
another. A ''document'' is a set of ''fields''). Per field, the score is the 
product of the following factors: Its "td" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the
  query itself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
  
  Interpreting the Nutch "explain.jsp", you need to have the Lucene scoring 
equation in mind. First, notice how we move right as we move from "score 
total", to "score per query term", to "score per query document field" (A 
document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).

[Nutch Wiki] Update of "FAQ" by MichaelStack

Reply via email to