[Nutch Wiki] Update of "MarcHammons" by MarcHammons

Apache Wiki Tue, 15 Nov 2005 18:10:13 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by MarcHammons:
http://wiki.apache.org/nutch/MarcHammons

------------------------------------------------------------------------------
- My Nutch Oddessy - Getting Nutch 0.7.1 up and running
+ == My Nutch Oddessy - Getting Nutch 0.7.1 up and running ==
- 
- Tools
+ ----
+ === Tools ===
   *apache-ant-1.6.5-bin.tar.gz
   *j2sdk-1_4_2_09-linux-i586-rpm.bin
   *jakarta-tomcat-4.1.31.tar.gz
@@ -10, +10 @@

   *poi-bin-2.5.1-final-20040804.tar.gz
   *poi-src-2.5.1-final-20040804.tar.gz
  
- Plugins
+ === Plugins ===
   *http://issues.apache.org/jira/browse/NUTCH-52?page=all
   *http://issues.apache.org/jira/browse/NUTCH-21?page=all
- 
+ ----
  So, getting Nutch to just crawl was straightforward.  I just followed the 
basic tutorial like everyone else and I actually had it crawling to a limited 
depth and breadth and parsing pdf and doc files in no time.  But...
  
  Then you need to scale up, and perhaps need a few changes, like HTTP basic 
authentication, which requires some reconfiguration and no matter how you slice 
it also requires some tweaking of the source and a recompile or two (or 20 or 
30 if you're me) ;)
  
  So here's what I did and a little why:
  
- Configuration:
+ === Configuration: ===
  nutch-site.xml
  
-  *http.timeout - I set this to 100000 because I have to deal with access to a 
clearcase based document repository and that sucker can be sloooow.
+  *'''http.timeout''' - I set this to 100000 because I have to deal with 
access to a clearcase based document repository and that sucker can be sloooow.
- http.max.delays - I also set this to 100000 for the same reason.  There's 
only one host and it can be slow.
  
-  *fetcher.server.delay - I set this to 0.1.  Even though there's one host I 
don't want the fetcher threads sitting around all day before they start to 
fetch the next URL.  Setting this lower makes my fetches faster.
+  *'''http.max.delays''' - I also set this to 100000 for the same reason.  
There's only one host and it can be slow.
  
-  *fetcher.therads.fetch - I set this to 15.  There are 3 hosts that my crawl 
would access and I only wanted a max of 5 threads per host (see below)
+  *'''fetcher.server.delay''' - I set this to 0.1.  Even though there's one 
host I don't want the fetcher threads sitting around all day before they start 
to fetch the next URL.  Setting this lower makes my fetches faster.
  
-  *fetcher.threads.per.host - I set this to 5.
+  *'''fetcher.therads.fetch''' - I set this to 15.  There are 3 hosts that my 
crawl would access and I only wanted a max of 5 threads per host (see below)
  
-  *parser.threads.parse - I set this to 15 in line with the 15/5 ratios that 
I've setup for my purposes
+  *'''fetcher.threads.per.host''' - I set this to 5.
  
-  *indexer.max.title.length - I set this to 1024.  We have document naming 
conventions that can produce some really long names.
+  *'''parser.threads.parse''' - I set this to 15 in line with the 15/5 ratios 
that I've setup for my purposes
  
-  *indexer.max.tokens - I set this to 10000.  Although we do have some really 
large documents the little PC that I'm using to crawl doesn't have much memory 
(physical and virtual) so I'll have to live with any resulting truncation until 
I get more powerful HW.
+  *'''indexer.max.title.length''' - I set this to 1024.  We have document 
naming conventions that can produce some really long names.
  
-  *ftp.keep.connection - I set this to true.  Given that I'm only accessing a 
few hosts and accessing them over and over it doesn't make sense to close those 
connections as that just creates more overhead when crawling.
+  *'''indexer.max.tokens''' - I set this to 10000.  Although we do have some 
really large documents the little PC that I'm using to crawl doesn't have much 
memory (physical and virtual) so I'll have to live with any resulting 
truncation until I get more powerful HW.
  
-  *http.content.limit - I set this to -1.  This allows for full downloading of 
the file.  I am guessing that any further size restrictions are then imposed by 
indexer.max.tokens (without peeking into the source to verify this).
+  *'''ftp.keep.connection''' - I set this to true.  Given that I'm only 
accessing a few hosts and accessing them over and over it doesn't make sense to 
close those connections as that just creates more overhead when crawling.
  
-  *plugin.includes - I updated the regex to include pdf|msword|powerpoint
+  *'''http.content.limit''' - I set this to -1.  This allows for full 
downloading of the file.  I am guessing that any further size restrictions are 
then imposed by indexer.max.tokens (without peeking into the source to verify 
this).
  
-  *http.auth.basic.username - This is a bit special as it is part of my HTTP 
basic authentication hack.  The value of this would be your userid.
+  *'''plugin.includes''' - I updated the regex to include pdf|msword|powerpoint
  
-  *http.auth.basic.password - Again part of the HTTP basic authentication 
hack.  The value of this would be your password.  I know, not secure, but it 
works for now.
+  *'''http.auth.basic.username''' - This is a bit special as it is part of my 
HTTP basic authentication hack.  The value of this would be your userid.
  
-  *http.auth.verbose - I set this to true so that some additional debugging 
would be available in the logs.
+  *'''http.auth.basic.password''' - Again part of the HTTP basic 
authentication hack.  The value of this would be your password.  I know, not 
secure, but it works for now.
  
-  *fetcher.verbose - I set this to true for debugging.
+  *'''http.auth.verbose''' - I set this to true so that some additional 
debugging would be available in the logs.
  
-  *http.verbose - I set this to true for debugging
+  *'''fetcher.verbose''' - I set this to true for debugging.
  
+  *'''http.verbose''' - I set this to true for debugging
+ 
-  *parser.html.impl - I set this to tagsoup.  Some of the pages that I was 
parsing did not adhere to strict HTML syntax and subsequently some of the URLs 
that I was wanting to be included in the fetch were being discarded. Using 
tagsoup let me catch those URLs and prune out others as long as my regex sets 
were a bit more specific.
+  *'''parser.html.impl''' - I set this to tagsoup.  Some of the pages that I 
was parsing did not adhere to strict HTML syntax and subsequently some of the 
URLs that I was wanting to be included in the fetch were being discarded. Using 
tagsoup let me catch those URLs and prune out others as long as my regex sets 
were a bit more specific.
+ 
+ crawl-urlfilter.txt
+ 
+ 
+ Just recall that these are done in top down order.  So if you want something 
discarded early it needs to go higher up in the file.  My extension pruning 
regular expression has gotten a little big at this point but I really don't 
want this stuff in the mix.
+ 
+ 
-\.(asm|bac|bak|bin|c|cat|cc|cdf|cdl|cfg|cgi|cpp|css|csv|dot|eps|exe|fm|gif|GIF|gz|h|ics|ico|ICO|iso|jar|java|jpg|JPG|l|lnt|mdl|mif|mov|MOV|mpg|mpp|msg|mso|mspat|mtdf|ndd|o|oft|orig|out|pjt|pl|pm|png|PNG|prc|prp|ps|rpm|rtf|sh|sit|st|tar|tb|tc|tgz|wmf|xla|xls|xml|y|Z|zip)$
+ 
+ Then I prune away based on host name (hosts changed to protect the innocent)
+ -^http://hostthatidontwant.my.domain.name/.*
+ 
+ Then I prune away paths that are in hosts that I do want, e.g. don't include 
any doxygen pages for source browsing
+ -.*/doxygen/.*
+ 
+ Then put in the hosts that I do want starting at the depth that I want in 
certain scenarios
+ 
+ +^http://hostthatIwant.my.domain.com/.*
+ +^http://anotherhostiwant.my.domain.com/and/start/here/.*
+ 
+ Then skip everything else
+ -.
  
  
  
@@ -69, +91 @@

  
  
  
- Compilation
  
+ === Compilation ===
+

[Nutch Wiki] Update of "MarcHammons" by MarcHammons

Reply via email to