[Nutch Wiki] Update of "MarcHammons" by MarcHammons

Apache Wiki Tue, 15 Nov 2005 17:55:59 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by MarcHammons:
http://wiki.apache.org/nutch/MarcHammons

------------------------------------------------------------------------------
+ My Nutch Oddessy - Getting Nutch 0.7.1 up and running
- My Nutch Oddessy
- 
- Getting Nutch 0.7.1 running...
  
  Tools
- apache-ant-1.6.5-bin.tar.gz
+  *apache-ant-1.6.5-bin.tar.gz
- j2sdk-1_4_2_09-linux-i586-rpm.bin
+  *j2sdk-1_4_2_09-linux-i586-rpm.bin
- jakarta-tomcat-4.1.31.tar.gz
+  *jakarta-tomcat-4.1.31.tar.gz
- nutch-0.7.1.tar.gz
+  *nutch-0.7.1.tar.gz
- poi-bin-2.5.1-final-20040804.tar.gz
+  *poi-bin-2.5.1-final-20040804.tar.gz
- poi-src-2.5.1-final-20040804.tar.gz
+  *poi-src-2.5.1-final-20040804.tar.gz
  
  Plugins
- http://issues.apache.org/jira/browse/NUTCH-52?page=all
+  *http://issues.apache.org/jira/browse/NUTCH-52?page=all
- http://issues.apache.org/jira/browse/NUTCH-21?page=all
+  *http://issues.apache.org/jira/browse/NUTCH-21?page=all
+ 
+ So, getting Nutch to just crawl was straightforward.  I just followed the 
basic tutorial like everyone else and I actually had it crawling to a limited 
depth and breadth and parsing pdf and doc files in no time.  But...
+ 
+ Then you need to scale up, and perhaps need a few changes, like HTTP basic 
authentication, which requires some reconfiguration and no matter how you slice 
it also requires some tweaking of the source and a recompile or two (or 20 or 
30 if you're me) ;)
+ 
+ So here's what I did and a little why:
+ 
+ Configuration:
+ nutch-site.xml
+ 
+  *http.timeout - I set this to 100000 because I have to deal with access to a 
clearcase based document repository and that sucker can be sloooow.
+ http.max.delays - I also set this to 100000 for the same reason.  There's 
only one host and it can be slow.
+ 
+  *fetcher.server.delay - I set this to 0.1.  Even though there's one host I 
don't want the fetcher threads sitting around all day before they start to 
fetch the next URL.  Setting this lower makes my fetches faster.
+ 
+  *fetcher.therads.fetch - I set this to 15.  There are 3 hosts that my crawl 
would access and I only wanted a max of 5 threads per host (see below)
+ 
+  *fetcher.threads.per.host - I set this to 5.
+ 
+  *parser.threads.parse - I set this to 15 in line with the 15/5 ratios that 
I've setup for my purposes
+ 
+  *indexer.max.title.length - I set this to 1024.  We have document naming 
conventions that can produce some really long names.
+ 
+  *indexer.max.tokens - I set this to 10000.  Although we do have some really 
large documents the little PC that I'm using to crawl doesn't have much memory 
(physical and virtual) so I'll have to live with any resulting truncation until 
I get more powerful HW.
+ 
+  *ftp.keep.connection - I set this to true.  Given that I'm only accessing a 
few hosts and accessing them over and over it doesn't make sense to close those 
connections as that just creates more overhead when crawling.
+ 
+  *http.content.limit - I set this to -1.  This allows for full downloading of 
the file.  I am guessing that any further size restrictions are then imposed by 
indexer.max.tokens (without peeking into the source to verify this).
+ 
+  *plugin.includes - I updated the regex to include pdf|msword|powerpoint
+ 
+  *http.auth.basic.username - This is a bit special as it is part of my HTTP 
basic authentication hack.  The value of this would be your userid.
+ 
+  *http.auth.basic.password - Again part of the HTTP basic authentication 
hack.  The value of this would be your password.  I know, not secure, but it 
works for now.
+ 
+  *http.auth.verbose - I set this to true so that some additional debugging 
would be available in the logs.
+ 
+  *fetcher.verbose - I set this to true for debugging.
+ 
+  *http.verbose - I set this to true for debugging
+ 
+  *parser.html.impl - I set this to tagsoup.  Some of the pages that I was 
parsing did not adhere to strict HTML syntax and subsequently some of the URLs 
that I was wanting to be included in the fetch were being discarded. Using 
tagsoup let me catch those URLs and prune out others as long as my regex sets 
were a bit more specific.
  
  
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ 
+ Compilation
+

[Nutch Wiki] Update of "MarcHammons" by MarcHammons

Reply via email to