[Nutch-general] Re: Problem with fetching segment

Håvard W. Kongsgård Tue, 13 Dec 2005 08:28:08 -0800

Sorry I misunderstood the way whole-web crawling works.

One more question, how do I re-fetch the failed urls (failed with:java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceededhttp.max.delays: retry later.).


Is this controlled by



<property>

 <name>db.default.fetch.interval</name>

 <value>30</value>

 <description>The default number of days between re-fetches of a page.

 </description>

</property>



Stefan Groschupf wrote:

Sorry, I still do not understand what your problem is, may it is timefor the weekend... :-)
From your very first mail there is exactly the same in the log:..
060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null

Isn't that the same as
060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
In any case that are just logging statement what makes you guess thatsomething crashed?
Stefan




Am 09.12.2005 um 17:44 schrieb Håvard W. Kongsgård:
But then i fetch the other domains www.sf.net <http://www.sf.net/ >..... the output is only
060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http://lucene.apache.org/nutch/bot.html; [email protected])
060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer:org.apache.nutch.net.BasicUrlNormalizer060109 014724 status: segment 20060109014654, 3 pages, 0 errors,51033 bytes, 8309 ms060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0bytes/page
there is not output like
060109 154712 fetching http://www.niap.no/magasinet/layout/set/print
060109 154712 fetching http://www.niap.no/magasinet/kontakt_oss
060109 154712 fetching http://www.niap.no/magasinet/ezinfo/about
060109 154712 fetching http://www.niap.no/index.php/magasinet/nyheter/midt_sten
Stefan Groschupf wrote:
What is  java.net.SocketTimeoutException?
Can not connect to the server.
In general you hammer your webserver and it may block the ip ofyour server.You can setup how many threads per host are loading from one hostserver.For a intranet crawl it is a good idea to have less less thread(may just as much you plan to use at the same time for the host)e.g. fetcherThreads = 2 maxThreadsPerHost = 2If you have more threads you should increase the retry / delayconfiguration since in case a host is busy with the maximalthreads per host the thread is delayed.If a thread is delayed to often than you get a Exceededhttp.max.delays: retry later....
Sometimes I'm asking myself if not a queue based fetching would bebetter the actually implementation, however this is difficult tochange.
HTH
Stefan
------------------------------------------------------------------------
No virus found in this incoming message.
Checked by AVG Free Edition.
Version: 7.1.371 / Virus Database: 267.13.13/195 - Release Date:08.12.2005




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Problem with fetching segment

Reply via email to