[Nutch-cvs] [Nutch Wiki] Update of "FAQ" by Sergey

Apache Wiki Sat, 02 Sep 2006 06:38:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by Sergey:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  ==== What happens if I inject urls several times? ====
  
  Urls, which are already in the database, won't be injected.
- 
- ==== Java.io.IOException: No input directories specified in: NutchConf: 
nutch-default.xml , mapred-default.xml ====
- 
- This really is a crawl tool issue, but is covered here as weel: The crawl 
tool expects as its first parameter the folder name where the seeding urls file 
is located so for example if your urls.txt is located in /nutch/seeds the crawl 
command would look like: crawl seeds -dir /user/nutchuser...
  
  === Fetching ===
  
@@ -166, +162 @@

  ==== While fetching I get UnknownHostException for known hosts ====
  
  Make sure your DNS server is working and/or it can handle the load of 
requests.
- 
- 
- ==== It seems as if not all links are followed in the pages in my URL lists 
====
- 
- 1.) Make sure that your expressions in conf/crawl-urlfilter.txt are correct, 
perhaps the links are dropped there.
- 
- 2.) Make sure that in conf/nutch-site.xml the following parameters are set 
appropriate:
- 
- * http.content.limit: otherwise some content my never be fetched at all
- * db.max.outlinks.per.page: otherwise the links might be dropped.
- 
- 3.) Make sure you have the parse-js and all other necessary plugins active in 
conf/nutch-site.xml
  
  === Updating ===
  
@@ -351, +335 @@

  
  ==== How is scoring done in Nutch? (Or, explain the "explain" page?) ====
  
- Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc].  Lucene scoring looks to be based on the Vector 
Space Model of Information Retrieval science.  Roughly, the score for a 
particular document in a set of query results, "score(q,d)", is the sum of the 
score for each term of a query ("t in q"). A terms score in a document is 
itself the sum of the term run against each field that comprises a document 
("title" is one field, "url" another. A "document" is a set of "fields"). Per 
field, the score is the product of the following factors: Its "td" (term 
freqency in the document), a score factor "idf" (a factor made up of frequency 
of term relative to amount of docs in index), an index-time boost, a 
normalization of count of terms found relative to size 
 of document ("lengthNorm"), a similar normalization is done for the term in 
the query itself ("queryNorm"), and finally, a factor with a weight for how 
many instances of the total amount of terms a particular document contains. 
Study the lucene javadoc to get more detail on each of the equation components 
and how they effect overall score.
+ Nutch is built on Lucene. To understand Nutch scoring, study how Lucene does 
it. The formula Lucene uses scoring can be found at the head of the Lucene 
Similarity class in the 
[http://lucene.apache.org/java/docs/api/org/apache/lucene/search/Similarity.html
 Lucene Similarity Javadoc]. Roughly, the score for a particular document in a 
set of query results, "score(q,d)", is the sum of the score for each term of a 
query ("t in q"). A terms score in a document is itself the sum of the term run 
against each field that comprises a document ("title" is one field, "url" 
another. A "document" is a set of "fields"). Per field, the score is the 
product of the following factors: Its "td" (term freqency in the document), a 
score factor "idf" (usually a factor made up of frequency of term relative to 
amount of docs in index), an index-time boost, a normalization of count of 
terms found relative to size of document ("lengthNorm"), a similar 
normalization is done for the term in the query i
 tself ("queryNorm"), and finally, a factor with a weight for how many 
instances of the total amount of terms a particular document contains. Study 
the lucene javadoc to get more detail on each of the equation components and 
how they effect overall score.
  
  Interpreting the Nutch "explain.jsp", you need to have the above cited Lucene 
scoring equation in mind. First, notice how we move right as we move from 
"score total", to "score per query term", to "score per query document field" 
(A document field is not shown if a term was not found in a particular field). 
Next, studying a particular field scoring, it comprises a query component and 
then a field component. The query component includes query time -- as opposed 
to index time -- boost, an "idf" that is same for the query and field 
components, and then a "queryNorm". Similar for the field component 
("fieldNorm" is an aggregation of certain of the Lucene equation components).
  
@@ -373, +357 @@

  
  Results as RSS (XML) rather than HTML are easier for programmatic clients to 
parse: such clients will query against 
[http://lucene.apache.org/nutch/apidocs/org/apache/nutch/searcher/OpenSearchServlet.html
 OpenSearchServlet] rather than search.jsp.  Results as XML can also be 
transformed using XSL stylesheets, the likely direction of UI development going 
forward.
  
- ==== How can I enable Porter Stemming in Nutch? ====
- Stemming in nutch can be implemented in the following ways. However, you will 
have to modify some internal files.
- Thanks to Howie Wang for developing the code for version 0.7.2. I updated his 
code for 0.8 and it should work without any issues.
- 
- 
[[http://wiki.apache.org/nutch/Stemming#head-3d04863d929228d17d0a8402a32c3900f7b13be7
 Stemming for Version 0.7.2]]
- 
- 
[[http://wiki.apache.org/nutch/Stemming#head-d314c23de10878dc34d76337632908d9f940e907
 Stemming for Version 0.8]]
- 
  === Crawling ===
  
  ==== Java.io.IOException: No input directories specified in: NutchConf: 
nutch-default.xml , mapred-default.xml ====
  
- The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seeds -dir 
/user/nutchuser...
+ The crawl tool expects as its first parameter the folder name where the 
seeding urls file is located so for example if your urls.txt is located in 
/nutch/seeds the crawl command would look like: crawl seed -dir 
/user/nutchuser...
  
- === ReCrawling ===
- Here are scripts to help you with Intranet recrawling. 
- ==== Version 0.7.2 ====
- Place in your main Nutch directory.
- 
- 
[http://wiki.apache.org/nutch/IntranetRecrawl#head-76dc88d48baa3f429f58e9827ff2debba1f66098
 Recrawl-0.7.2]
- ==== Version 0.8.0 ====
- Place in the bin sub-directory of Nutch.
- 
- 
[http://wiki.apache.org/nutch/IntranetRecrawl#head-e58e25a0b9530bb6fcdfb282fd27a207fc0aff03
 Recrawl-0.8.0]
  === Discussion ===
  
  [http://grub.org/ Grub] has some interesting ideas about building a search 
engine using distributed computing. ''And how is that relevant to nutch?''

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-cvs mailing list
Nutch-cvs@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-cvs

[Nutch-cvs] [Nutch Wiki] Update of "FAQ" by Sergey

Reply via email to