[Nutch Wiki] Update of "Marc's Nutch 0.7.1 Page" by MarcHammons

Apache Wiki Sat, 10 Dec 2005 07:46:54 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by MarcHammons:
http://wiki.apache.org/nutch/Marc's_Nutch_0%2e7%2e1_Page

------------------------------------------------------------------------------
   *http://issues.apache.org/jira/browse/NUTCH-52?page=all
   *http://issues.apache.org/jira/browse/NUTCH-21?page=all
  ----
- So, getting Nutch to just crawl was straightforward.  I just followed the 
basic crawl tutorial like everyone else and I actually had it crawling to a 
limited depth and breadth and parsing pdf and doc files in no time.  But...
+ So, getting Nutch to just crawl was straightforward.  I just followed the 
basic [http://lucene.apache.org/nutch/tutorial.html crawl tutorial] like 
everyone else and I actually had it crawling to a limited depth and breadth and 
parsing pdf and doc files in no time.  But...
  
  Then you need to scale up, want to add file types, and perhaps need a few 
changes, like HTTP basic authentication, which requires some reconfiguration 
and no matter how you slice it also requires some tweaking of the source and a 
recompile or two (or 20 or 30 if you're me) ;)
  
@@ -32, +32 @@

  === Configuration: ===
  ''file: nutch-site.xml''
  
-  *'''http.timeout''' - I set this to 100000 because I have to deal with 
access to a clearcase based document repository and that sucker can be sloooow.
+  *'''http.timeout''' - I set this to 100000 because I have to deal with 
access to a clearcase based document repository and it can be sloooow.
  
   *'''http.max.delays''' - I also set this to 100000 for the same reason.  
There's only one host and it can be slow.
  
   *'''fetcher.server.delay''' - I set this to 0.1.  Even though there's one 
host I don't want the fetcher threads sitting around all day before they start 
to fetch the next URL.  Setting this lower drops latency between fetches, over 
time those can add up.
  
-  *'''fetcher.therads.fetch''' - I set this to 15.  There are 3 hosts that my 
crawl would access and I only wanted a max of 5 threads per host (see below)
+  *'''fetcher.therads.fetch''' - I set this to 15.  There are 3 hosts that my 
crawl would access and I only wanted a max of 5 threads per host (see below).  
I'm not sure what kind of parallelism can be achieved with these threads on a 
single CPU host and I'm not willing to spend the time to really investigate 
further.  Let's just say I feel good with these at 15.
  
   *'''fetcher.threads.per.host''' - I set this to 5.
  
@@ -54, +54 @@

  
   *'''plugin.includes''' - I updated the regex to include pdf|msword|powerpoint
  
-  *'''http.auth.basic.username''' - This is a bit special as it is part of my 
HTTP basic authentication hack.  The value of this would be your userid.
+  *'''http.auth.basic.username''' - This is a bit special as it is part of my 
HTTP basic authentication hack.  The value of this would be your userid.  More 
on this below.
  
-  *'''http.auth.basic.password''' - Again part of the HTTP basic 
authentication hack.  The value of this would be your password.  I know, not 
secure, but it works.
+  *'''http.auth.basic.password''' - Again part of the HTTP basic 
authentication hack.  The value of this would be your password.  All of you IS 
admins cringing; I know, not secure, but it works.
  
   *'''http.auth.verbose''' - I set this to true so that some additional 
debugging would be available in the logs.
  
@@ -70, +70 @@

  
  ''file: crawl-urlfilter.txt''
  
- Just recall that the regexes are evaluated in top down order.  So if you want 
something discarded early it needs to go higher up in the file.  My extension 
pruning regular expression has gotten a little big at this point but I really 
don't want this stuff in the mix.
+ Just recall that the regexes are evaluated in top down order.  So if you want 
something discarded early it needs to go higher up in the file.  My extension 
pruning regular expression has gotten a little big at this point but there is a 
ton of content there and I really don't want this stuff in the mix.
  
  {{{
  
-\.(asm|bac|bak|bin|c|cat|cc|cdf|cdl|cfg|cgi|cpp|css|csv|dot|eps|exe|fm|gif|GIF|gz|h|ics|ico|ICO|iso|jar|java|jpg|JPG|l|lnt|mdl|mif|mov|MOV|mpg|mpp|msg|mso|mspat|mtdf|ndd|o|oft|orig|out|pjt|pl|pm|png|PNG|prc|prp|ps|rpm|rtf|sh|sit|st|tar|tb|tc|tgz|wmf|xla|xls|xml|y|Z|zip)$
  }}}
  
- Then I prune away based on host name, hosts changed to protect the innocent
+ Then I prune away based on host name.
  {{{
  -^http://hostthatidontwant.my.domain.name/.*
  }}}
@@ -116, +116 @@

  ----
  __''Hacking in Basic Authentication''__
  
- From reading through the source I can kind of guess where the development of 
authentication might be headed, but I still needed a quick hack to get basic 
authentication up and going.  There's something in place for NTML, but nothing 
for just basic authentication.  So just following the existing NTML hack as a 
paradigm, knowing that it will disappear eventually I added the code below.  
It's more customary to provide this as a patch file, but I'm feeling lazy this 
evening.
+ Reading through the source one gets a sense of where the development of 
authentication might be headed, but I still needed a quick hack to get basic 
authentication up and going.  There's something in place for NTML, but nothing 
for just basic authentication.  So just following the existing NTML 
implementation as a paradigm, knowing that it will disappear eventually, I 
added the code below.  I may get around to adding patch files to this page for 
things that I did at some point.
  
  ''file: nutch-site.xml''
  
@@ -175, +175 @@

  ...
  }}}
  
- What you should immediately notice is that the AuthScope object allows 
ANY_HOST and ANY_PORT and applies the credentials uniformly across them.  Given 
that I'm crawling an intranet I really don't have to concern myself with the 
user/pass changing so I didn't add anything for flexibility there.
+ What you should immediately notice is that the AuthScope object allows 
ANY_HOST and ANY_PORT and applies the credentials uniformly across them.  Given 
that I'm crawling an intranet I really don't have to concern myself with the 
user/pass changing so I didn't add anything for flexibility there.  You may 
want to do this should you need to supply different credentials for different 
hosts in your network.
  ----
  __''MS Powerpoint plugin failing''__
  
@@ -216, +216 @@

  
  Next for me includes adding in the MS excel plugin.  I havn't been using it 
simply because the spreadsheets that we use are of less significence than the 
documents themselves and I've read a few comments to the effect that the excel 
plugin is working but not fully functional or correct 100% of the time.
  
- I plan on scaling up my crawl as I become familiar with the tools that are 
availalbe.  At this point I'm ignorant of the full complement of features 
available to me, but then again my task is not as large as that of some of you 
out there with clusters indexing millions upon millions of pages.  Nonetheless, 
my appreciation of what Nutch has given me compelled me to spend some time 
writing this up.  Thanks Nutch team.
+ Another interesting item might be to put some kind of feedback loop regarding 
system resources into the crawler.  I find it very convienent to just use the 
crawl tool instead of having to write scripts to do everything.  I would be 
nice to be able to have the crawl tool factor in memory such that should the 
tool begin to come close to some predefined ceiling it could dump a segment and 
begin another.  Easier said than done I'm sure, but it would be handy.
  
- Hope this helps some of you... Regards, Marc
+ I plan on scaling up my crawl to includes a few different locations on my 
intranet as I become familiar with the tools that are availalbe.  At this point 
I'm ignorant of the full complement of features available to me, but then again 
my task is not as large as that of some of you out there with clusters indexing 
millions upon millions of pages.  Nonetheless, my appreciation of what Nutch 
has given me compelled me to spend some time writing this up.  Thanks Nutch 
team and I hope this helps some of you out there.
  
+ Regards, Marc
+

[Nutch Wiki] Update of "Marc's Nutch 0.7.1 Page" by MarcHammons

Reply via email to