Stefan Groschupf wrote:
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be may be better to have the index also in the segment folder, since the segment folder would be the single item to manage a life-cycle,

The current indexer command line is optimized for one-shot, batch, crawling. In this case it is best to index everything at the end, in order to have the most up-to-date page scores from the crawl db. So it indexes everything in a single MapReduce pass, which produces a set of indexes that are not aligned with segments.

It would be easy to modify Indexer.index() to index just a segment at a time, but each would need to process the entire crawl and link dbs as inputs, and would thus be less efficient than indexing all segments at once.

So both modes may be useful. We could add an Indexer.index() method that takes just a single segment name and indexes it, storing the index in the segment, and modify Indexer.main() to be able to invoke it. Then we'd also need to modify NutchBean to find these indexes, and IndexMerger, etc.

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to