[Nutch-general] Re: Extracting multiple entries from a single URL

Stefan Groschupf Thu, 23 Feb 2006 12:29:02 -0800

Hi Ragy,

well this is may difficult. The problem is that nutch 0.7 and 0.8 iscentralized around urls.You have the crawl DB or webdb and the keys of these entries areurls. So to get the complete workflow running with multiple documentswould be difficult.However what you can do in nutch 0.8 is writing a own map reduce job,where the mapper is extracting the content and the reduce generatesan index from that.In such a case you need also write a custom ui since the datastructures will different e.g. would not have the segments data.

Anyway it is possible but some work.
HTH
Stefan


Am 23.02.2006 um 19:13 schrieb Ragy Eleish:

Hi,
I have a need to get multiplte search results entries from a singleURL. For
example I want to index the photo captions in this url
http://racer007.albumpost.com/montreal without having to navigateto each
picture page, because sometimes there is no individual picture page.
I did it by writing an HTMLParserFilter, modifying ParseData, andFetcher,then disabling the clean duplicate code in the CrawlerTool. I didthis in
Nutch 0.7.1 Is there a better way of doing thing?

Regards

--Ragy


---------------------------------------------
blog: http://www.find23.org
company: http://www.media-style.com




-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Extracting multiple entries from a single URL

Reply via email to