Stefan, Thanks for the info. I want to limit the crawl to our new site I have change the +^http://([a-z0-9]*\.)*apache.org/ to +^http://([a-z0-9]*\.)*woodward.edu/ but I get to one server and it is stuck in an never ending crawl, it's a calendar server. I want to limit the crawl to one server and that's all. I need to only crawl the main site of "www" which is www.woodward.edu and not everything with *.woodward.edu we have too many servers that I don't need to get info for. Any suggestions on this. andy -----Original Message----- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Monday, February 06, 2006 6:58 PM To: [email protected] Subject: Re: How deep to go
Instead of using the crawl command I personal prefer the manually commands. Than I use a small script that runs http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling in a never ending loop where a wait for a day for each iteration. This will make sure that you have all links that match you url filter. Just dont miss to remove old segments and merge indexes together, more about such things can be found in the mail archive. Also don't miss to add the plugins (e.g. pdf parser). HTH Stefan Am 05.02.2006 um 19:54 schrieb Andy Morris: > How deep should a good intranet crawl be...10-20? > I still can't get all of my site searchable.. > > Here is my situation... > I want to crawl just a local site for our intranet. We have just > rolled out an asp only website from a pure html site. I ran nutch on > the old site and got great results. Since moving to this new site > I am > have a devil of a time retrieving good information and missing a > ton of > info all together. I am not sure what settings I need to change to > get > good results. One setting that I have set does produce good > results but > it seems to crawl other website and not just my domain. The last line > of the crawl-urlfilter file I just replace the - with + so it does not > ignore other information. Our site is www.woodward.edu I was wondering > if someone on this list can crawl this site and only this domain > and see > what they come up with. Woodward.edu is the domain. I am just > stumped > as what to do next. I am running a nightly build from January 26th > 2006. > > My criteria for our local search is to be able to search PDF, images, > doc, and web content. You can go here and see what the search page > pulls up http://search.woodward.edu . > > Thanks for any help this list can provide. > Andy Morris > --------------------------------------------------------------- company: http://www.media-style.com forum: http://www.text-mining.org blog: http://www.find23.net ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
