Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by MarcHammons: http://wiki.apache.org/nutch/MarcHammons ------------------------------------------------------------------------------ - My Nutch Oddessy - Getting Nutch 0.7.1 up and running + == My Nutch Oddessy - Getting Nutch 0.7.1 up and running == - - Tools + ---- + === Tools === *apache-ant-1.6.5-bin.tar.gz *j2sdk-1_4_2_09-linux-i586-rpm.bin *jakarta-tomcat-4.1.31.tar.gz @@ -10, +10 @@ *poi-bin-2.5.1-final-20040804.tar.gz *poi-src-2.5.1-final-20040804.tar.gz - Plugins + === Plugins === *http://issues.apache.org/jira/browse/NUTCH-52?page=all *http://issues.apache.org/jira/browse/NUTCH-21?page=all - + ---- So, getting Nutch to just crawl was straightforward. I just followed the basic tutorial like everyone else and I actually had it crawling to a limited depth and breadth and parsing pdf and doc files in no time. But... Then you need to scale up, and perhaps need a few changes, like HTTP basic authentication, which requires some reconfiguration and no matter how you slice it also requires some tweaking of the source and a recompile or two (or 20 or 30 if you're me) ;) So here's what I did and a little why: - Configuration: + === Configuration: === nutch-site.xml - *http.timeout - I set this to 100000 because I have to deal with access to a clearcase based document repository and that sucker can be sloooow. + *'''http.timeout''' - I set this to 100000 because I have to deal with access to a clearcase based document repository and that sucker can be sloooow. - http.max.delays - I also set this to 100000 for the same reason. There's only one host and it can be slow. - *fetcher.server.delay - I set this to 0.1. Even though there's one host I don't want the fetcher threads sitting around all day before they start to fetch the next URL. Setting this lower makes my fetches faster. + *'''http.max.delays''' - I also set this to 100000 for the same reason. There's only one host and it can be slow. - *fetcher.therads.fetch - I set this to 15. There are 3 hosts that my crawl would access and I only wanted a max of 5 threads per host (see below) + *'''fetcher.server.delay''' - I set this to 0.1. Even though there's one host I don't want the fetcher threads sitting around all day before they start to fetch the next URL. Setting this lower makes my fetches faster. - *fetcher.threads.per.host - I set this to 5. + *'''fetcher.therads.fetch''' - I set this to 15. There are 3 hosts that my crawl would access and I only wanted a max of 5 threads per host (see below) - *parser.threads.parse - I set this to 15 in line with the 15/5 ratios that I've setup for my purposes + *'''fetcher.threads.per.host''' - I set this to 5. - *indexer.max.title.length - I set this to 1024. We have document naming conventions that can produce some really long names. + *'''parser.threads.parse''' - I set this to 15 in line with the 15/5 ratios that I've setup for my purposes - *indexer.max.tokens - I set this to 10000. Although we do have some really large documents the little PC that I'm using to crawl doesn't have much memory (physical and virtual) so I'll have to live with any resulting truncation until I get more powerful HW. + *'''indexer.max.title.length''' - I set this to 1024. We have document naming conventions that can produce some really long names. - *ftp.keep.connection - I set this to true. Given that I'm only accessing a few hosts and accessing them over and over it doesn't make sense to close those connections as that just creates more overhead when crawling. + *'''indexer.max.tokens''' - I set this to 10000. Although we do have some really large documents the little PC that I'm using to crawl doesn't have much memory (physical and virtual) so I'll have to live with any resulting truncation until I get more powerful HW. - *http.content.limit - I set this to -1. This allows for full downloading of the file. I am guessing that any further size restrictions are then imposed by indexer.max.tokens (without peeking into the source to verify this). + *'''ftp.keep.connection''' - I set this to true. Given that I'm only accessing a few hosts and accessing them over and over it doesn't make sense to close those connections as that just creates more overhead when crawling. - *plugin.includes - I updated the regex to include pdf|msword|powerpoint + *'''http.content.limit''' - I set this to -1. This allows for full downloading of the file. I am guessing that any further size restrictions are then imposed by indexer.max.tokens (without peeking into the source to verify this). - *http.auth.basic.username - This is a bit special as it is part of my HTTP basic authentication hack. The value of this would be your userid. + *'''plugin.includes''' - I updated the regex to include pdf|msword|powerpoint - *http.auth.basic.password - Again part of the HTTP basic authentication hack. The value of this would be your password. I know, not secure, but it works for now. + *'''http.auth.basic.username''' - This is a bit special as it is part of my HTTP basic authentication hack. The value of this would be your userid. - *http.auth.verbose - I set this to true so that some additional debugging would be available in the logs. + *'''http.auth.basic.password''' - Again part of the HTTP basic authentication hack. The value of this would be your password. I know, not secure, but it works for now. - *fetcher.verbose - I set this to true for debugging. + *'''http.auth.verbose''' - I set this to true so that some additional debugging would be available in the logs. - *http.verbose - I set this to true for debugging + *'''fetcher.verbose''' - I set this to true for debugging. + *'''http.verbose''' - I set this to true for debugging + - *parser.html.impl - I set this to tagsoup. Some of the pages that I was parsing did not adhere to strict HTML syntax and subsequently some of the URLs that I was wanting to be included in the fetch were being discarded. Using tagsoup let me catch those URLs and prune out others as long as my regex sets were a bit more specific. + *'''parser.html.impl''' - I set this to tagsoup. Some of the pages that I was parsing did not adhere to strict HTML syntax and subsequently some of the URLs that I was wanting to be included in the fetch were being discarded. Using tagsoup let me catch those URLs and prune out others as long as my regex sets were a bit more specific. + + crawl-urlfilter.txt + + + Just recall that these are done in top down order. So if you want something discarded early it needs to go higher up in the file. My extension pruning regular expression has gotten a little big at this point but I really don't want this stuff in the mix. + + -\.(asm|bac|bak|bin|c|cat|cc|cdf|cdl|cfg|cgi|cpp|css|csv|dot|eps|exe|fm|gif|GIF|gz|h|ics|ico|ICO|iso|jar|java|jpg|JPG|l|lnt|mdl|mif|mov|MOV|mpg|mpp|msg|mso|mspat|mtdf|ndd|o|oft|orig|out|pjt|pl|pm|png|PNG|prc|prp|ps|rpm|rtf|sh|sit|st|tar|tb|tc|tgz|wmf|xla|xls|xml|y|Z|zip)$ + + Then I prune away based on host name (hosts changed to protect the innocent) + -^http://hostthatidontwant.my.domain.name/.* + + Then I prune away paths that are in hosts that I do want, e.g. don't include any doxygen pages for source browsing + -.*/doxygen/.* + + Then put in the hosts that I do want starting at the depth that I want in certain scenarios + + +^http://hostthatIwant.my.domain.com/.* + +^http://anotherhostiwant.my.domain.com/and/start/here/.* + + Then skip everything else + -. @@ -69, +91 @@ - Compilation + === Compilation === +
