You have to detect changes of web content. http://zipclue.com
--- On Tue, 7/14/09, Neeti Gupta <[email protected]> wrote: > From: Neeti Gupta <[email protected]> > Subject: Re: recrawling > To: [email protected] > Date: Tuesday, July 14, 2009, 6:50 AM > > But are there any rules by which we can define when to > crawl a website to get > its updated contents > as soon as possible. > > > > Otis Gospodnetic-2 wrote: > > > > > > Neeti, > > > > I don't think there is a way to know when a regular > web site has been > > updated. You can issue GET or HEAD requests and > look at the Last-Modified > > date, but this is not 100% reliable. You can > fetch and compare content, > > but that's not 100% reliable either. If you are > indexing blogs, then you > > can get "pings" when they update, or can rely on > detecting changes in > > their feeds. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - > Nutch > > > > > > > > ----- Original Message ---- > >> From: Neeti Gupta <[email protected]> > >> To: [email protected] > >> Sent: Wednesday, June 24, 2009 7:52:47 AM > >> Subject: recrawling > >> > >> > >> we had made a crawler that visit various sites, > and i want the crawler to > >> crawl sites as soon as they are updated, if anyone > can help me to know > >> how i > >> can know when the site is updated and its the time > to crawl again > >> -- > >> View this message in context: > >> http://www.nabble.com/recrawling-tp24183356p24183356.html > >> Sent from the Nutch - User mailing list archive at > Nabble.com. > > > > > > > > -- > View this message in context: > http://www.nabble.com/recrawling-tp24183356p24474563.html > Sent from the Nutch - User mailing list archive at > Nabble.com. > >
