[Nutch Wiki] Update of "FAQ" by DanielNaber

Apache Wiki Sun, 16 Dec 2007 08:58:57 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DanielNaber:
http://wiki.apache.org/nutch/FAQ

The comment on the change is:
mention OPICScoringFilter

------------------------------------------------------------------------------
  You can tweak your conf/common-terms.utf8 file after creating an index 
through the following command:
    bin/nutch org.apache.nutch.indexer.HighFreqTerms -count 10 -nofreqs index
  
- ==== What ranking algorithm is used in searches ? Does Nutch use the 
[http://en.wikipedia.org/wiki/HITS_algorithm HITS algorithm] ? ====
- 
- N/A yet
- 
  ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
  
- Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "td" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query i
 tself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "tf" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query i
 tself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
  
  Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene 
scoring equation in mind. First, notice how we move right as we move from 
"score total", to "score per query term", to "score per query document field" 
(A document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).
  
  ==== How can I influence Nutch scoring? ====
  
+ Scoring is implemented as a filter plugin, i.e. an implementation of the 
!ScoringFilter class. By default, 
[http://lucene.apache.org/nutch/apidocs-0.8.x/org/apache/nutch/scoring/opic/OPICScoringFilter.html
 OPICScoringFilter] is used.
+ 
- The easiest way to influence scoring is to change query time boosts (Will 
require edit of nutch-site.xml and redeploy of the WAR file). Query-time boost 
by default looks like this:{{{
+ However, the easiest way to influence scoring is to change query time boosts 
(Will require edit of nutch-site.xml and redeploy of the WAR file). Query-time 
boost by default looks like this:{{{
    query.url.boost, 4.0f
    query.anchor.boost, 2.0f
    query.title.boost, 1.5f

[Nutch Wiki] Update of "FAQ" by DanielNaber

Reply via email to