[Nutch-dev] Re: Indexing the whole WebDB or get Pages out of WebDB that are Indexed

Doug Cutting Mon, 15 Aug 2005 09:59:21 -0700

Nils Hoeller wrote:

is there a way to index the whole WebDB,
which means the normal sites  that have been indexed
+ the sites that are one depth deeper and sobeeing only stored in the WebDB

This is supposed to be possible, but I think no one has tried this in awhile and fear it may no longer work.

If you specify '-refetchonly' when you generate your fetchlist then itshould generate a fetchlist with fetch=false entries for all of thepreviously unfetched pages. Then the fetcher should pass these throughto the output with null content, and the indexer should index the urland incoming anchor texts.

But glancing at the current code it looks like IndexSegment.java doesnot index entries with ProtocolStatus.NOTFETCHING.

If you desire this behavior, please file a bug report. Also, please tryto patch IndexSegment.java so that it does index these entries. If thisworks, please attach your patch to the bug report.


Doug


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: Indexing the whole WebDB or get Pages out of WebDB that are Indexed

Reply via email to