Re: still not so clear to me

Andrzej Bialecki Tue, 07 Mar 2006 12:28:19 -0800

Richard Braman wrote:

Can someone confirm this:

Uou start a crawldb from a list of urls and you generate a fetch list,

which is akin to "seeding your crawldb". When you fetch it just fetches

those "seed" urls.When you do your next round of generate/fetch/update, The fetch list

will have the links found while parsing the pages in the original urls.
Then on your next round, it will fetch the links found during the

previous fetch.So with each round of fetching, nutch goes deeper and deeper into the

web, only fetching urls it hasn't previously fetched.
The generate command generates a fetch list first based on the seed
urls, then on the links found on that page (for each subsequent
iteration), then on the links on those pages, and so forth and son on
until the entire domain is crawled, if you limit the domains with a
filter.


Yes.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: still not so clear to me

Reply via email to