But are there any rules by which we can define when to crawl a website to get its updated contents as soon as possible.
Otis Gospodnetic-2 wrote: > > > Neeti, > > I don't think there is a way to know when a regular web site has been > updated. You can issue GET or HEAD requests and look at the Last-Modified > date, but this is not 100% reliable. You can fetch and compare content, > but that's not 100% reliable either. If you are indexing blogs, then you > can get "pings" when they update, or can rely on detecting changes in > their feeds. > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > ----- Original Message ---- >> From: Neeti Gupta <[email protected]> >> To: [email protected] >> Sent: Wednesday, June 24, 2009 7:52:47 AM >> Subject: recrawling >> >> >> we had made a crawler that visit various sites, and i want the crawler to >> crawl sites as soon as they are updated, if anyone can help me to know >> how i >> can know when the site is updated and its the time to crawl again >> -- >> View this message in context: >> http://www.nabble.com/recrawling-tp24183356p24183356.html >> Sent from the Nutch - User mailing list archive at Nabble.com. > > > -- View this message in context: http://www.nabble.com/recrawling-tp24183356p24474563.html Sent from the Nutch - User mailing list archive at Nabble.com.
