I can see it has only one segment (the segment created by ./bin/nutch
generate). Any reason why it is crawling more pages than given in the
fetchlist?

Thanks,
Somnath

On 5/1/07, qi wu <[EMAIL PROTECTED]> wrote:

How many segements were generated during your crawl ?
If you have more than one segements, then some newly parsed outlinks in
the page might be appended to crawldb.
To prevent this,you can try  updatedb with option "-noAddtions"  in
nutch91.

----- Original Message -----
From: "Somnath Banerjee" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, April 30, 2007 11:12 PM
Subject: Crawling fixed set of urls (newbie question)


> Hi,
>
>    I thought I have a very simple requirement. I just want to crawl a
fixed
> set of 2.3M urls. Following the tutorial I injected the urls in the
crawl
> db, generated a fetch list and started fetching. After 5 days I found it
has
> fetched 3M pages and fetching is still going on. I stopped the process
and
> now looking at the past posts in this group I just realized that I lost
5
> days of crawl.
>
>    Why it fetched more pages than it has in the fetch list. Is it
because I
> left the value of "db.max.outlinks.per.page" as 100. Also in the crawl
> command I didn't specify the "depth" parameter. Can somebody please help
me
> in understanding the process. In case it is already discussed if
possible
> please point me to the appropriate post.
>
>    From this mailing list what I gathered that I should generate small
set
> of fetch lists and merge the fetched contents. Since I my url set is
fixed I
> don't want nutch to discover new urls. My understanding is  "./bin/nutch
> updatedb" will discover new urls and next time I do "./bin/nutch
generate"
> it will add those discovered urls in the fetch list. Given that I just
want
> to crawl my fixed list of urls what is the best way to do that.
>
> Thanks in advance,
> -Som
> PS: I'm using nutch-0.9 in case that is required
>
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to