I managed to extract urld from segments which fetcher failed to fetch for some reason. I'm now thinking what's the best way to refetch those urls again? I was first thinking to creating another db/segments pair, inject these urls into the new webdb, fetching them and then merking the results back to the main database. Is there a better way?
Another question is about injecting urls into the webdb. I first inject some seed urls into the webdb and then starts fetching them, that's ok. But the bin/nutch generate command creates a new segment, with urls to fetch, from the webdb, right? What does the -topN parameter exactly do? Does it get N urls from the web db which has the greates rate/value/score, or does it simply get N urls from the webdb which has been pushed onto the "top" (is there a "top" in the webdb?) of the webdb? Last question is about refetching pages after a short while. I can construct a list of urls which has been added into the site (direcly from the filesystem using find -command), what's the best way to add them into the main database? And how should I create a new segment for fetching with urls which has not been indexed for N days? Thanks in advantage. I'll promise to add the answers to the Wiki as good as I can document them, because I bet that there's more users which are thinking the same questiond :/ - Juho Mäkinen ------------------------------------------------------- SF.Net email is sponsored by: Discover Easy Linux Migration Strategies from IBM. Find simple to follow Roadmaps, straightforward articles, informative Webcasts and more! Get everything you need to get up to speed, fast. http://ads.osdn.com/?ad_idt77&alloc_id492&op=click _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
