[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Apache Wiki Thu, 15 Sep 2011 13:00:50 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=127&rev2=128

  Please visit our [[http://lucene.apache.org/nutch/bot.html|"webmaster info 
page"]]
  
  ==== Will Nutch be a distributed, P2P-based search engine? ====
- We don't think it is presently possible to build a peer-to-peer search engine 
that is competitive with existing search engines. It would just be too slow. 
Returning results in less than a second is important: it lets people rapidly 
reformulate their queries so that they can more often find what they're looking 
for. In short, a fast search engine is a better search engine. I don't think 
many people would want to use a search engine that takes ten or more seconds to 
return results.
+ We don't think it is presently possible to build a peer-to-peer search engine 
that is competitive with existing search engines. It would just be too slow. 
Returning results in less than a second is important: it lets people rapidly 
reformulate their queries so that they can more often find what they're looking 
for. In short, a fast search engine is a better search engine. We don't think 
many people would want to use a search engine that takes ten or more seconds to 
return results.
  
  That said, if someone wishes to start a sub-project of Nutch exploring 
distributed searching, we'd love to host it. We don't think these techniques 
are likely to solve the hard problems Nutch needs to solve, but we'd be happy 
to be proven wrong.
  
@@ -27, +27 @@

  ==== What Java version is required to run Nutch? ====
  Nutch 0.7 will run with Java 1.4 and up. Nutch 1.0 with Java 6.
  
- ==== Exception: java.net.SocketException: Invalid argument or cannot assign 
requested address on Fedora Core 3 or 4 ====
- It seems you have installed IPV6 on your machine.
- 
- To solve this problem, add the following java param to the java instantiation 
in bin/nutch:
- 
- JAVA_IPV4=-Djava.net.preferIPv4Stack=true
- 
- # run it exec "$JAVA" $JAVA_HEAP_MAX $NUTCH_OPTS $JAVA_IPV4 -classpath 
"$CLASSPATH" $CLASS "$@"
- 
  ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ====
- nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is. 
nutch-site.xml is where you make the changes that override the default 
settings. The same goes to the servlet container application.
+ nutch-default.xml is the out of the box configuration for Nutch, and most 
configurations can (and should unless you know what your doing) stay as per. 
nutch-site.xml is where you make the changes that override the default settings.
  
- ==== My system does not find the segments folder. Why? Or: How do I tell the 
''Nutch Servlet'' where the index file are located? ====
- There are at least two choices to do that:
  
-  . First you need to copy the .WAR file to the servlet container webapps 
folder.
- 
- {{{
-    % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
- }}}
-  . 1) After building your first index, start Tomcat from the index folder.
-   . Assuming your index is located at /index :
- 
- {{{
- % cd /index/
- % $CATATALINA_HOME/bin/startup.sh
- }}}
-  . '''Now you can search.'''
- 
-  . 2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
- 
- {{{
- % $CATATALINA_HOME/bin/startup.sh
- % $CATATALINA_HOME/bin/shutdown.sh
- }}}
- {{{
- % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
- <?xml version="1.0"?>
- <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
- 
- <nutch-conf>
- 
-   <property>
-     <name>searcher.dir</name>
-     <value>/your_index_folder_path</value>
-   </property>
- 
- </nutch-conf>
- 
- % $CATATALINA_HOME/bin/startup.sh
- }}}
  === Injecting ===
  ==== What happens if I inject urls several times? ====
  Urls which are already in the database, won't be injected.
  
  === Fetching ===
  ==== Is it possible to fetch only pages from some specific domains? ====
- Please have a look on PrefixURLFilter. Adding some regular expressions to the 
urlfilter.regex.file might work, but adding a list with thousands of regular 
expressions would slow down your system excessively.
+ Please have a look on PrefixURLFilter. Adding some regular expressions to the 
regex-urlfilter.txt file might work, but adding a list with thousands of 
regular expressions would slow down your system excessively.
  
  Alternatively, you can set db.ignore.external.links to "true", and inject 
seeds from the domains you wish to crawl (these seeds must link to all pages 
you wish to crawl, directly or indirectly).  Doing this will let the crawl go 
through only these domains without leaving to start crawling external links.  
Unfortunately there is no way to record external links encountered for future 
processing, although a very small patch to the generator code can allow you to 
log these links to hadoop.log.
  
@@ -92, +45 @@

  Well, you can not. However, you have two choices to proceed:
  
   . 1) Recover the pages already fetched and than restart the fetcher.
-   . You'll need to create a file fetcher.done in the segment directory and 
then: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your index is 
at /index
+   . You'll need to create a file fetcher.done in the segment directory and 
then: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] . Assuming your crawl 
data is at /crawl
  
  {{{
  % touch /index/segments/2005somesegment/fetcher.done
  
- % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
+ % bin/nutch updatedb /crawl/db/ /crawl/segments/2005somesegment/
  
- % bin/nutch generate /index/db/ /index/segments/2005somesegment/
+ % bin/nutch generate /crawl/db/ /crawl/segments/2005somesegment/
  
- % bin/nutch fetch /index/segments/2005somesegment
+ % bin/nutch fetch /crawl/segments/2005somesegment
  }}}
   . All the pages that were not crawled will be re-generated for fetch. If you 
fetched lots of pages, and don't want to have to re-fetch them again, this is 
the best way.
  
@@ -121, +74 @@

   * Or send the process a unix STOP signal. You should be able to index the 
part of the segment for crawling which is allready fetched. Then later send a 
CONT signal to the process. Do not turn off your computer between! :)
  
  ==== How many concurrent threads should I use? ====
+ This is dependent on your particular set-up, unless one understands 
system/network environment variables it is impossible to accurately measure 
thread performance. The Nutch de-facto is an excellent start point. 
- This is dependent on your particular setup, but the following works for me:
- 
- If you are using a slow internet connection (ie- DSL), you might be limited 
to 40 or fewer concurrent fetches.
- 
- If you have a fast internet connection (> 10Mb/sec) your bottleneck will 
definitely be in the machine itself (in fact you will need multiple machines to 
saturate the data pipe).  Empirically I have found that the machine works well 
up to about 1000-1500 threads.
- 
- To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artificial cap 
on the DNS servers).  Also, in order to get the speed up to a reasonable value, 
we needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
- 
- To other users: please add to this with your own experiences, my own 
experience may be atypical.
  
  ==== How can I force fetcher to use custom nutch-config? ====
   * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
   * Copy these files from $NUTCH_HOME/conf to the new directory: 
common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, 
regex-normalize.xml, regex-urlfilter.txt
   * Modify the nutch-default.xml to suite your needs
-  * Set NUTCH_CONF_DIR environment variable to point into the directory you 
created
+  * Set NUTCH_CONF_DIR environment variable in $NUTCH_HOME/bin/nutch  to point 
into the directory you created
   * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment 
variable. You should check the command outputs for lines where the configs are 
loaded, that they are really loaded from your custom dir.
   * Happy using.
  
  ==== bin/nutch generate generates empty fetchlist, what can I do? ====
  The reason for that is that when a page is fetched, it is timestamped in the 
webdb. So basiclly if its time is not up it will not be included in a 
fetchlist. So for example if you generated a fetchlist and you deleted the 
segment dir created. calling generate again will generate an empty fetchlist. 
So, two choices:
  
+    1) Change your system date to be 30 days from today (if you haven't 
changed the default settings) and re-run bin/nutch generate
+    2) Call bin/nutch generate with the -adddays 30 (if you haven't changed 
the default settings) to make generate think the time has come... After 
generate you can call bin/nutch fetch.
-  . 1) Change your system date to be 30 days from today (if you haven't 
changed the default settings) and re-run bin/nutch generate... 2) Call 
bin/nutch generate with the -adddays 30 (if you haven't changed the default 
settings) to make generate think the time has come... After generate you can 
call bin/nutch fetch.
- 
- ==== While fetching I get UnknownHostException for known hosts ====
- Make sure your DNS server is working and/or it can handle the load of 
requests.
  
  ==== How can I fetch pages that require Authentication? ====
- See  [[HttpAuthenticationSchemes]].
+ See the [[HttpAuthenticationSchemes]] wiki page.
  
  === Updating ===
  ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr 
index? ====

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Reply via email to