Re: recrawling

Neeti Gupta Mon, 13 Jul 2009 23:50:54 -0700

But are there any rules by which we can define when to crawl a website to get
its updated contents
as soon as possible.




Otis Gospodnetic-2 wrote:
> 
> 
> Neeti,
> 
> I don't think there is a way to know when a regular web site has been
> updated.  You can issue GET or HEAD requests and look at the Last-Modified
> date, but this is not 100% reliable.  You can fetch and compare content,
> but that's not 100% reliable either.  If you are indexing blogs, then you
> can get "pings" when they update, or can rely on detecting changes in
> their feeds.
> 
>  Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> 
> ----- Original Message ----
>> From: Neeti Gupta <[email protected]>
>> To: [email protected]
>> Sent: Wednesday, June 24, 2009 7:52:47 AM
>> Subject: recrawling
>> 
>> 
>> we had made a crawler that visit various sites, and i want the crawler to
>> crawl sites as soon as they are updated, if anyone can help me to know
>> how i
>> can know when the site is updated and its the time to crawl again
>> -- 
>> View this message in context: 
>> http://www.nabble.com/recrawling-tp24183356p24183356.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/recrawling-tp24183356p24474563.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: recrawling

Reply via email to