Hi, You can have Nutch crawl and index pretty much everything, for specific protocols and formats you only need to write custom protocol, parse and maybe even indexing plugins.
The protocol plugin, takes care of accessing the content. The parse plugin takes care of parsing the content, extracting valuable pices and the indexer plugin takes care of indexing it. Nutch comes with plugins for some standard protocols, such as: file, ftp and html and parser plugins such as: html, excel, word, powerpoint, pdf, rss, flash, zip if you need more you either have to look for a specific extension, make it your self or find someone to do it for you. best regards, Magnus http://lucenejobs.com On Tue, Jul 21, 2009 at 5:59 AM, Arshad Khan <[email protected]>wrote: > Hello > > I am working on a project that requires downloading full text research > papers from PubMed automatically. Can Nutch provide such capability? If yes > then are there any examples or tutorials that can help to get Nutch > configured for this purpose? > > TIA > Arshad >
