Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MarcHammons: http://wiki.apache.org/nutch/Marc's_Nutch_0%2e7%2e1_Page ------------------------------------------------------------------------------ *http://issues.apache.org/jira/browse/NUTCH-52?page=all *http://issues.apache.org/jira/browse/NUTCH-21?page=all ---- - So, getting Nutch to just crawl was straightforward. I just followed the basic crawl tutorial like everyone else and I actually had it crawling to a limited depth and breadth and parsing pdf and doc files in no time. But... + So, getting Nutch to just crawl was straightforward. I just followed the basic [http://lucene.apache.org/nutch/tutorial.html crawl tutorial] like everyone else and I actually had it crawling to a limited depth and breadth and parsing pdf and doc files in no time. But... Then you need to scale up, want to add file types, and perhaps need a few changes, like HTTP basic authentication, which requires some reconfiguration and no matter how you slice it also requires some tweaking of the source and a recompile or two (or 20 or 30 if you're me) ;) @@ -32, +32 @@ === Configuration: === ''file: nutch-site.xml'' - *'''http.timeout''' - I set this to 100000 because I have to deal with access to a clearcase based document repository and that sucker can be sloooow. + *'''http.timeout''' - I set this to 100000 because I have to deal with access to a clearcase based document repository and it can be sloooow. *'''http.max.delays''' - I also set this to 100000 for the same reason. There's only one host and it can be slow. *'''fetcher.server.delay''' - I set this to 0.1. Even though there's one host I don't want the fetcher threads sitting around all day before they start to fetch the next URL. Setting this lower drops latency between fetches, over time those can add up. - *'''fetcher.therads.fetch''' - I set this to 15. There are 3 hosts that my crawl would access and I only wanted a max of 5 threads per host (see below) + *'''fetcher.therads.fetch''' - I set this to 15. There are 3 hosts that my crawl would access and I only wanted a max of 5 threads per host (see below). I'm not sure what kind of parallelism can be achieved with these threads on a single CPU host and I'm not willing to spend the time to really investigate further. Let's just say I feel good with these at 15. *'''fetcher.threads.per.host''' - I set this to 5. @@ -54, +54 @@ *'''plugin.includes''' - I updated the regex to include pdf|msword|powerpoint - *'''http.auth.basic.username''' - This is a bit special as it is part of my HTTP basic authentication hack. The value of this would be your userid. + *'''http.auth.basic.username''' - This is a bit special as it is part of my HTTP basic authentication hack. The value of this would be your userid. More on this below. - *'''http.auth.basic.password''' - Again part of the HTTP basic authentication hack. The value of this would be your password. I know, not secure, but it works. + *'''http.auth.basic.password''' - Again part of the HTTP basic authentication hack. The value of this would be your password. All of you IS admins cringing; I know, not secure, but it works. *'''http.auth.verbose''' - I set this to true so that some additional debugging would be available in the logs. @@ -70, +70 @@ ''file: crawl-urlfilter.txt'' - Just recall that the regexes are evaluated in top down order. So if you want something discarded early it needs to go higher up in the file. My extension pruning regular expression has gotten a little big at this point but I really don't want this stuff in the mix. + Just recall that the regexes are evaluated in top down order. So if you want something discarded early it needs to go higher up in the file. My extension pruning regular expression has gotten a little big at this point but there is a ton of content there and I really don't want this stuff in the mix. {{{ -\.(asm|bac|bak|bin|c|cat|cc|cdf|cdl|cfg|cgi|cpp|css|csv|dot|eps|exe|fm|gif|GIF|gz|h|ics|ico|ICO|iso|jar|java|jpg|JPG|l|lnt|mdl|mif|mov|MOV|mpg|mpp|msg|mso|mspat|mtdf|ndd|o|oft|orig|out|pjt|pl|pm|png|PNG|prc|prp|ps|rpm|rtf|sh|sit|st|tar|tb|tc|tgz|wmf|xla|xls|xml|y|Z|zip)$ }}} - Then I prune away based on host name, hosts changed to protect the innocent + Then I prune away based on host name. {{{ -^http://hostthatidontwant.my.domain.name/.* }}} @@ -116, +116 @@ ---- __''Hacking in Basic Authentication''__ - From reading through the source I can kind of guess where the development of authentication might be headed, but I still needed a quick hack to get basic authentication up and going. There's something in place for NTML, but nothing for just basic authentication. So just following the existing NTML hack as a paradigm, knowing that it will disappear eventually I added the code below. It's more customary to provide this as a patch file, but I'm feeling lazy this evening. + Reading through the source one gets a sense of where the development of authentication might be headed, but I still needed a quick hack to get basic authentication up and going. There's something in place for NTML, but nothing for just basic authentication. So just following the existing NTML implementation as a paradigm, knowing that it will disappear eventually, I added the code below. I may get around to adding patch files to this page for things that I did at some point. ''file: nutch-site.xml'' @@ -175, +175 @@ ... }}} - What you should immediately notice is that the AuthScope object allows ANY_HOST and ANY_PORT and applies the credentials uniformly across them. Given that I'm crawling an intranet I really don't have to concern myself with the user/pass changing so I didn't add anything for flexibility there. + What you should immediately notice is that the AuthScope object allows ANY_HOST and ANY_PORT and applies the credentials uniformly across them. Given that I'm crawling an intranet I really don't have to concern myself with the user/pass changing so I didn't add anything for flexibility there. You may want to do this should you need to supply different credentials for different hosts in your network. ---- __''MS Powerpoint plugin failing''__ @@ -216, +216 @@ Next for me includes adding in the MS excel plugin. I havn't been using it simply because the spreadsheets that we use are of less significence than the documents themselves and I've read a few comments to the effect that the excel plugin is working but not fully functional or correct 100% of the time. - I plan on scaling up my crawl as I become familiar with the tools that are availalbe. At this point I'm ignorant of the full complement of features available to me, but then again my task is not as large as that of some of you out there with clusters indexing millions upon millions of pages. Nonetheless, my appreciation of what Nutch has given me compelled me to spend some time writing this up. Thanks Nutch team. + Another interesting item might be to put some kind of feedback loop regarding system resources into the crawler. I find it very convienent to just use the crawl tool instead of having to write scripts to do everything. I would be nice to be able to have the crawl tool factor in memory such that should the tool begin to come close to some predefined ceiling it could dump a segment and begin another. Easier said than done I'm sure, but it would be handy. - Hope this helps some of you... Regards, Marc + I plan on scaling up my crawl to includes a few different locations on my intranet as I become familiar with the tools that are availalbe. At this point I'm ignorant of the full complement of features available to me, but then again my task is not as large as that of some of you out there with clusters indexing millions upon millions of pages. Nonetheless, my appreciation of what Nutch has given me compelled me to spend some time writing this up. Thanks Nutch team and I hope this helps some of you out there. + Regards, Marc +
