Re: recrawling

Sjaiful Bahri Tue, 14 Jul 2009 00:31:02 -0700

You have to detect changes of web content.

http://zipclue.com



--- On Tue, 7/14/09, Neeti Gupta <[email protected]> wrote:

> From: Neeti Gupta <[email protected]>
> Subject: Re: recrawling
> To: [email protected]
> Date: Tuesday, July 14, 2009, 6:50 AM
> 
> But are there any rules by which we can define when to
> crawl a website to get
> its updated contents
> as soon as possible.
> 
> 
> 
> Otis Gospodnetic-2 wrote:
> > 
> > 
> > Neeti,
> > 
> > I don't think there is a way to know when a regular
> web site has been
> > updated.  You can issue GET or HEAD requests and
> look at the Last-Modified
> > date, but this is not 100% reliable.  You can
> fetch and compare content,
> > but that's not 100% reliable either.  If you are
> indexing blogs, then you
> > can get "pings" when they update, or can rely on
> detecting changes in
> > their feeds.
> > 
> >  Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr -
> Nutch
> > 
> > 
> > 
> > ----- Original Message ----
> >> From: Neeti Gupta <[email protected]>
> >> To: [email protected]
> >> Sent: Wednesday, June 24, 2009 7:52:47 AM
> >> Subject: recrawling
> >> 
> >> 
> >> we had made a crawler that visit various sites,
> and i want the crawler to
> >> crawl sites as soon as they are updated, if anyone
> can help me to know
> >> how i
> >> can know when the site is updated and its the time
> to crawl again
> >> -- 
> >> View this message in context: 
> >> http://www.nabble.com/recrawling-tp24183356p24183356.html
> >> Sent from the Nutch - User mailing list archive at
> Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://www.nabble.com/recrawling-tp24183356p24474563.html
> Sent from the Nutch - User mailing list archive at
> Nabble.com.
> 
>

Re: recrawling

Reply via email to