[Nutch-general] RE: How deep to go

Andy Morris Mon, 06 Feb 2006 18:01:04 -0800

Stefan,
Thanks for the info.
I want to limit the crawl to our new site I have change the
+^http://([a-z0-9]*\.)*apache.org/ to
+^http://([a-z0-9]*\.)*woodward.edu/  but I get to one server and it is
stuck in an never ending crawl, it's a calendar server.  I want to limit
the crawl to one server and that's all.  I need to only crawl the main
site of "www" which is www.woodward.edu and not everything with
*.woodward.edu we have too many servers that I don't need to get info
for. Any suggestions on this.
andy
-----Original Message-----
From: Stefan Groschupf [mailto:[EMAIL PROTECTED] 
Sent: Monday, February 06, 2006 6:58 PM
To: [email protected]
Subject: Re: How deep to go


Instead of using the crawl command I personal prefer the manually
commands.
Than I use a small script that runs
http://lucene.apache.org/nutch/tutorial.html#Whole-web+Crawling
in a never ending loop where a wait for a day for each iteration.
This will make sure that you have all links that match you url filter.
Just dont miss to remove old segments and merge indexes together, more
about such things can be found in the mail archive.
Also don't miss to add the plugins (e.g. pdf parser).

HTH
Stefan

Am 05.02.2006 um 19:54 schrieb Andy Morris:

> How deep should a good intranet crawl be...10-20?
> I still can't get all of my site searchable..
>
> Here is my situation...
> I want to crawl just a local site for our intranet.   We have just
> rolled out an asp only website from a pure html site.  I ran nutch on
> the old site and got great results.  Since moving to this new site  
> I am
> have a devil of a time retrieving good information and missing a  
> ton of
> info all together.  I am not sure what settings I need to change to  
> get
> good results.  One setting that I have set does produce good  
> results but
> it seems to crawl other website and not just my domain.  The last line
> of the crawl-urlfilter file I just replace the - with + so it does not
> ignore other information. Our site is www.woodward.edu I was wondering
> if someone on this list can crawl this site and only this domain  
> and see
> what they come up with.  Woodward.edu is the domain.  I am just  
> stumped
> as what to do next.  I am running a nightly build from January 26th
> 2006.
>
> My criteria for our local search is to be able to search PDF, images,
> doc, and web content.  You can go here and see what the search page
> pulls up http://search.woodward.edu .
>
> Thanks for any help this list can provide.
> Andy Morris
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: How deep to go

Reply via email to