Re: freegen handles duplicate (reccurent urls) in crawldb?

Andrzej Bialecki Wed, 19 Sep 2007 12:12:44 -0700

eyal edri wrote:

Hi,


I've been advised to use the 'freegen' tool in order to generate & fetch a
new and fresh url list (from a txt file) whilst disregarding any depth x
urls injected to the crawldb
as a result of previous fetches.

I've run a small test and noticed that by using the freegen tool, nutch
doesnt check for duplicate urls that has been fetched already and are in the
crawldb.
meaning, that i can fetch a number of urls from a certain segment, update
them to the db,
and then use generate/freegen with the same urls and fetch them and it will
not check if those urls are already fetched (resulting in an unnecessary
fetch action).
i'm quite convinced that when doing updatedb afterwards, it will remove
dups, but still it's not efficent.

anyway, i will be glad if someone could help with that. (perhaps contradict
me even).

You are correct, that's the way this tool works. When you use generate,it does check the crawldb. freegen, as the name implies, allows you tocreate arbitrary fetchlists, without checking the crawldb.

If you want to be able to generate arbitrary fetchlist which containonly those urls that are absent from crawldb or unfetched, then you needto write another tool to do that.



--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: freegen handles duplicate (reccurent urls) in crawldb?

Reply via email to