You are fetching one single URL in a loop of 14 fetches (14 separate JVM
executions). It is possible, smth wrong with DNS Servers in your LAN/WAN at
the time of these fetches. At 11th fetch, DNS-to-IP was resolved, and Nutch
was able to fetch 1-st page, and subsequent pages. Just as sample. 

Difficult to reproduce...


=========
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173409

0 here means Size of a Segment, NOT number of pages in a fetch list. Check
org.apache.nutch.segment.SegmentReader, SegmentReader.size is the number of
entries in FetcherOutput.DIR_NAME 

Number of entries in FetcherOutput.DIR_NAME  - it should be probably totals
of success/error (different HTTP responses), I can't go in a very deep
details now... 0 means that no pages were fetched (even HTTP errors)...




-----Original Message-----
From: Bryan Woliner
Sent: Sunday, January 15, 2006 11:44 PM
To: 
Subject: Re: How can no URLs be fetched until the 11th round of fetching?


I don't think that I was completely clear in my first post. What you
are saying makes sense if I was doing a one-round fetch on a number of
different occasions. However, I am doing 14 rounds of fetching each
called by one script, in the pattern outlined in the nutch tutorial,
where my script does 14 loops of the following:
------

bin/nutch generate db segments
s[i]=`ls -d segments/2* | tail -1`
bin/nutch fetch $s[i]
bin/nutch updatedb db $s[i]
------

Do you think the possibilities you suggested makes sense in light of
the fact that I am doing each of these rounds of fetching within
seconds of each other, each being called by the same script?

I also have a couple of related questions?

(1) In the first round of fetching, the fetchlist is generated from
the database, which was injected with the one URL that comprises my
urls file. If in the first round of fetching, the one URL in the fetch
list can't be fetched and/or parsed, I am assuming that subsequent
rounds of fetching just used the same one-URL fetchlist until this URL
is successfully fetched and its outlinks added to the database. Is
that correct?

(2) When I call the following command, the resulting file has no
output for the rounds where no URLs were fetched. This leads me to
believe that the fact that no URLs were fetched is not a result of a
fetching or parsing error (since such errors usually show up in the
output of this command). Does this make sense? If it does, then what
caused no URLs to be fetched.

Thanks for any helpful suggestions,
Bryan

On 1/15/06, Fuad Efendi <[EMAIL PROTECTED]> wrote:
> Many things could happen.
>
> Sample1: website was unavailable during first 10 fetches
> Sample2: 11th fetch used different IP, DNS-to-IP mapping changed (or may
be
> finally resolved!)
> Sample3: Smth changed on a site, "redirect" added/changed, etc.
> Sample4: web-master modified robots.txt
> Sample5: big first HTML file, network errors during first 10 fetch
attempts,
> etc.
>
> It should be very uncommon behaviour, but it may happen...
>
>
> -----Original Message-----
> From: Bryan Woliner
>
> I am using Nutch 0.7.1 (no mapreduce) and did a whole-web crawl with 14
> rounds of fetching and an urls files with one URL in it. No urls were
> fetched during the first 10 rounds, but then in the 11th round one URL was
> fetched and increasing more URLs were fetched in rounds 12-14. I am basing
> the numbers of URLs fetched  on the  output from calling bin/nutch segread
> (included below). I don't understand how this can happen. If a URL is not
> fetched during a round are its outlinks still added to the database for
the
> next round of fetching? Why would I have 10 rounds of fetching with no
URLs
> fetched and then suddenly have one fetched successfully in the 11th round?
>
> Any suggestions are appreciated.
> -Bryan
>
> Here is the output when I call:
>
> bin/nutch segread -list -dir segments
>
> run java in /usr/local/j2sdk1.4.2_08
> 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-default.xml
> 060115 205601 parsing file:/home/bryan/nutch-0.7.1/conf/nutch-site.xml
> 060115 205601 No FS indicated, using default:local
> 060115 205601 PARSED?   STARTED                 FINISHED
> COUNT   DIR NAME
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173409
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173413
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173417
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173421
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173424
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173428
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173432
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173436
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173440
> 060115 205601 true      19691231-18:00:00       19691231-18:00:00
> 0       ../segments/20060115173443
> 060115 205602 true      20060115-17:34:51       20060115-17:34:51
> 1       ../segments/20060115173447
> 060115 205602 true      20060115-17:34:57       20060115-17:41:07
> 42      ../segments/20060115173454
> 060115 205602 true      20060115-17:41:16       20060115-18:12:28
> 234     ../segments/20060115174113
> 060115 205602 true      20060115-18:12:37       20060115-19:51:07
> 738     ../segments/20060115181234
> 060115 205602 TOTAL: 1015 entries in 14 segments.
>
>


Reply via email to