[Nutch-general] Question about injecting, generating fetch segment and refetching

Juho Mäkinen Wed, 22 Jun 2005 09:00:28 -0700

I managed to extract urld from segments which fetcher failed
to fetch for some reason. I'm now thinking what's the best
way to refetch those urls again? I was first thinking to
creating another db/segments pair, inject these urls into the new webdb,
fetching them and then merking the results back to the main database.
Is there a better way?


Another question is about injecting urls into the webdb. I first inject
some seed urls into the webdb and then starts fetching them, that's ok.
But the bin/nutch generate command creates a new segment,
with urls to fetch, from the webdb, right? What does the -topN parameter
exactly do? Does it get N urls from the web db which has the greates
rate/value/score, or does it simply get N urls from the webdb which
has been pushed onto the "top" (is there a "top" in the webdb?) of the webdb?

Last question is about refetching pages after a short while. I can construct
a list of urls which has been added into the site (direcly from the filesystem
using find -command), what's the best way to add them into the main database?
And how should I create a new segment for fetching with urls which has not
been indexed for N days?

Thanks in advantage. I'll promise to add the answers to the Wiki as good
as I can document them, because I bet that there's more users which
are thinking the same questiond :/

 - Juho Mäkinen


-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Question about injecting, generating fetch segment and refetching

Reply via email to