[Nutch-general] Auto-crawling & re-crawling the web site

bob knob Tue, 11 Apr 2006 07:06:26 -0700

Hi,

I am currently evaluating Nutch for use on an intranet
site search engine. I am by no means an expert in this
field although I am trying to learn more about it.


1 I was reading one of the articles referenced on the
nutch site:

http://today.java.net/pub/a/today/2006/02/16/introduction-to-nutch-2.html

-and I was a little bit concerned about its warning
concerning "re-crawling" the site. I understand that
there are several steps of crawling, building the
index, etc., but it sounded to me like new pages on my
web site would be ignored until I restarted the Nutch
server even after I've re-crawled. Am I correct about
this? How do most people deal with it?

2 It seems like I would want to re-crawl or re-index
the site on a nightly basis. All of this seems to be
done with shell scripts, and I wonder what options are
available to someone working on a Windows platform. I
could run cygrunsrv/cron on Windows I guess. Is there
some reason more of this scripting couldn't be redone
as a Java program? Also, has anybody considered
creating a Windows service to manage indexing/crawling
like the one that manages the Tomcat web server?

Thanks,
Bob

__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Auto-crawling & re-crawling the web site

Reply via email to