I managed to extract urld from segments which fetcher failed
to fetch for some reason. I'm now thinking what's the best
way to refetch those urls again? I was first thinking to
creating another db/segments pair, inject these urls into the new webdb,
fetching them and then merking the results back to the main database.
Is there a better way?
Updatedb with the segment where fetcher failed.
Then generate a new segments.
Stuff which was not fetched whould be resceduled after 7 days, I think.
If this time is not over, you could use -adddays option of the generate modul.


Another question is about injecting urls into the webdb. I first inject
some seed urls into the webdb and then starts fetching them, that's ok.
But the bin/nutch generate command creates a new segment,
with urls to fetch, from the webdb, right? What does the -topN parameter
exactly do? Does it get N urls from the web db which has the greates
rate/value/score, or does it simply get N urls from the webdb which
has been pushed onto the "top" (is there a "top" in the webdb?) of the webdb?
Correct, -topN should use the best scored urls form db.



Last question is about refetching pages after a short while. I can construct
a list of urls which has been added into the site (direcly from the filesystem
using find -command), what's the best way to add them into the main database?
If you would crawl the whole page, you would also have the anchor texts in your db.


And how should I create a new segment for fetching with urls which has not
been indexed for N days?
Have a look at the times speciefied in the config files and use the -adddays option from the generate modul.

Matthias
--
http://www.eventax.com - eventax GmbH
http://www.umkreisfinder.de - Die Suchmaschine für Lokales und Events


-------------------------------------------------------
This SF.Net email is sponsored by the 'Do More With Dual!' webinar happening
July 14 at 8am PDT/11am EDT. We invite you to explore the latest in dual
core and dual graphics technology at this free one hour event hosted by HP, AMD, and NVIDIA. To register visit http://www.hp.com/go/dualwebinar
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to