Stefan Groschupf wrote:
in nutch 0.8 the index is not in the segment folder any more.
What was the reason for that? in the context of a web gui it would be
may be better to have the index also in the segment folder, since the
segment folder would be the single item to manage a life-cycle,
The current indexer command line is optimized for one-shot, batch,
crawling. In this case it is best to index everything at the end, in
order to have the most up-to-date page scores from the crawl db. So it
indexes everything in a single MapReduce pass, which produces a set of
indexes that are not aligned with segments.
It would be easy to modify Indexer.index() to index just a segment at a
time, but each would need to process the entire crawl and link dbs as
inputs, and would thus be less efficient than indexing all segments at once.
So both modes may be useful. We could add an Indexer.index() method
that takes just a single segment name and indexes it, storing the index
in the segment, and modify Indexer.main() to be able to invoke it. Then
we'd also need to modify NutchBean to find these indexes, and
IndexMerger, etc.
Doug
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers