Boris Suchkov wrote:
Is it possible to look up the content of a page (stored in fetcher_content) by some identifier other than recno? Obviously it can be done by actually doing a search and then clicking on "cached" for the desired page, but can it be done directly, i.e. by looking up a page in the index? I would like to avoid having to go sequentially through the fetcher_content file to find a page when I know it is somehow done in the search process.
Nutch uses the index for that - if you have indexed the fetched content, then the index (which is a Lucene index) contains all necessary information to pinpoint the right recno in the fetcher_content data file. The index contains fields like <segment> and <docNo> which point to the right record, while also providing you the URL and title.
You may want to use Luke (http://www.getopt.org/luke) to dissect your index first and see what's inside.
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - digital self defense, top technical experts, no vendor pitches, unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers
