When I feed my domain into the database the segment fetch output was
like this:
-.-.-.-.-.-.-.-.-.-.-.-.-
060109 154622 fetching
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte
060109 154622 fetching http://www.niap.no/magasinet/nyheter/afrika
060109 154622 fetching http://www.niap.no/magasinet/nyheter/asia_australia
060109 154622 fetching
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya
060109 154622 fetching http://www.niap.no/magasinet/rss/feed/magasinet_rss1
060109 154622 fetching http://www.niap.no/magasinet/content/search
060109 154622 fetching
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap
060109 154622 fetching
http://www.niap.no/magasinet/nyheter/europa/russland/stalin_vender_tilbake
060109 154622 fetching http://www.niap.no/magasinet/nyheter/nord_amerika
060109 154626 fetch okay, but can't parse
http://www.niap.no/magasinet/rss/feed/magasinet_rss1, reason:
failed(2,203): Content-Type not text/html: text/xml
060109 154626 fetching
http://www.niap.no/magasinet/nyheter/midtoesten/irak/al_queida
060109 154633 Using URL normalizer: org.apache.nutch.net.BasicUrlNormalizer
060109 154633 fetching http://www.niap.no/magasinet/niap/test
060109 154639 fetching
http://www.niap.no/magasinet/nyheter/europa/italia/pave_benedict_xvi
060109 154642 fetch of
http://www.niap.no/magasinet/nyheter/nord_amerika/usa/israelsk_lobby_sparker_to_ansatte
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetch of
http://www.niap.no/magasinet/nyheter/asia_australia failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/nyheter/afrika
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetching http://www.niap.no/magasinet/nyheter/soer_amerika
060109 154642 fetch of
http://www.niap.no/magasinet/nyheter/europa/tyrkia/tyrkia_vil_innfoere_fengselstraff_for_utroskap
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetch of
http://www.niap.no/magasinet/nyheter/midtoesten/palestina_israel/israel_bekymret_for_landets_internasjonale_image
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetch of
http://www.niap.no/magasinet/nyheter/midtoesten/libya/eu_oensker_aa_oppheve_forbudet_mot_vaapenhandel_med_libya
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetch of http://www.niap.no/magasinet/content/search
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154642 fetching
http://www.niap.no/index.php/magasinet/nyheter/s_r_amerika
-.-.-.-.-.-.-
But then
-.-.-.-.-.-
060109 154714 fetch of http://phpadsnew.niap.no/adx.js failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later.
060109 154714 fetching
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria
060109 154722 fetch of http://www.niap.org/ failed with:
java.lang.Exception: java.net.SocketTimeoutException: connect timed out
060109 154724 fetch of
http://www.niap.no/index.php/magasinet/nyheter/nord_amerika failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/kontakt_oss failed
with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154724 fetch of
http://www.niap.no/magasinet/magasinet/om_magasinet failed with:
java.lang.Exception: org.apache.nutch.protocol.RetryLater: Exceeded
http.max.delays: retry later.
060109 154724 fetch of http://www.niap.no/magasinet/layout/set/print
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154729 fetch of
http://www.niap.no/magasinet/nyheter/midtoesten/syria/russland_selger_luftforsvarssystem_til_syria
failed with: java.lang.Exception: org.apache.nutch.protocol.RetryLater:
Exceeded http.max.delays: retry later.
060109 154730 status: segment 20060109154516, 12 pages, 31 errors,
181559 bytes, 68511 ms
060109 154730 status: 0.17515436 pages/s, 20.703678 kb/s, 15129.917
bytes/page
-.-.-.-.-.-
What is java.net.SocketTimeoutException?
Håvard W. Kongsgård wrote:
Is the fetcher not supposed to fetch all the docs from the urls
provide in the ulrs.txt file?
The fetch process only takes some seconds, and the whole quick
tutorial is done in a minute.
Stefan Groschupf wrote:
I can not see any problems in your log, it fetched successfully 3 pages.
Can provide a more specific problem description?
Am 09.12.2005 um 01:57 schrieb Håvard W. Kongsgård:
I have followed the media-style.com quick tutorial, but when I try
to fetch my segment the fetch is killed!
Have tried to set the system timer + 30 days, no anti-virus is
running on the systems.
System SUSE 9.2 and SUSE 10
# bin/nutch fetch segments/20060109014654/
060109 014714 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-
default.xml
060109 014715 parsing file:/home/hkongsgaard/nutch-0.7.1/conf/nutch-
site.xml
060109 014715 No FS indicated, using default:local
060109 014715 Plugins: looking in: /home/hkongsgaard/nutch-0.7.1/
plugins
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
query-more
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-
site/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.site.SiteQueryFilter
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-
html/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.html.HtmlParser
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/parse-
text/plugin.xml
060109 014715 impl: point=org.apache.nutch.parse.Parser
class=org.apache.nutch.parse.text.TextParser
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
parse-ext
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
parse-pdf
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
parse-rss
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-
basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.basic.BasicQueryFilter
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
index-more
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
parse-js
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/
urlfilter-regex/plugin.xml
060109 014715 impl: point=org.apache.nutch.net.URLFilter
class=org.apache.nutch.net.RegexURLFilter
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
protocol-ftp
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
parse-msword
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
creativecommons
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
ontology
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/nutch-
extensionpoints/plugin.xml
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
protocol-file
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/
protocol-http/plugin.xml
060109 014715 impl: point=org.apache.nutch.protocol.Protocol
class=org.apache.nutch.protocol.http.Http
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
clustering-carrot2
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
language-identifier
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
urlfilter-prefix
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/query-
url/plugin.xml
060109 014715 impl: point=org.apache.nutch.searcher.QueryFilter
class=org.apache.nutch.searcher.url.URLQueryFilter
060109 014715 parsing: /home/hkongsgaard/nutch-0.7.1/plugins/index-
basic/plugin.xml
060109 014715 impl: point=org.apache.nutch.indexer.IndexingFilter
class=org.apache.nutch.indexer.basic.BasicIndexingFilter
060109 014715 not including: /home/hkongsgaard/nutch-0.7.1/plugins/
protocol-httpclient
060109 014715 logging at INFO
060109 014715 fetching http://www.sourceforge.net/
060109 014715 fetching http://www.apache.org/
060109 014715 fetching http://www.nutch.org/
060109 014715 http.proxy.host = null
060109 014715 http.proxy.port = 8080
060109 014715 http.timeout = 10000
060109 014715 http.content.limit = -1
060109 014715 http.agent = NutchCVS/0.7.1 (Nutch; http://
lucene.apache.org/nutch/bot.html; [email protected])
060109 014715 fetcher.server.delay = 5000
060109 014715 http.max.delays = 52
060109 014718 Using URL normalizer:
org.apache.nutch.net.BasicUrlNormalizer
060109 014724 status: segment 20060109014654, 3 pages, 0 errors,
51033 bytes, 8309 ms
060109 014724 status: 0.36105427 pages/s, 47.98355 kb/s, 17011.0
bytes/page
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general