Hi Ragy,
well this is may difficult. The problem is that nutch 0.7 and 0.8 is
centralized around urls.
You have the crawl DB or webdb and the keys of these entries are
urls. So to get the complete workflow running with multiple documents
would be difficult.
However what you can do in nutch 0.8 is writing a own map reduce job,
where the mapper is extracting the content and the reduce generates
an index from that.
In such a case you need also write a custom ui since the data
structures will different e.g. would not have the segments data.
Anyway it is possible but some work.
HTH
Stefan
Am 23.02.2006 um 19:13 schrieb Ragy Eleish:
Hi,
I have a need to get multiplte search results entries from a single
URL. For
example I want to index the photo captions in this url
http://racer007.albumpost.com/montreal without having to navigate
to each
picture page, because sometimes there is no individual picture page.
I did it by writing an HTMLParserFilter, modifying ParseData, and
Fetcher,
then disabling the clean duplicate code in the CrawlerTool. I did
this in
Nutch 0.7.1 Is there a better way of doing thing?
Regards
--Ragy
---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com
-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general