Re: Using Nutch to crawl PubMed

Magnús Skúlason Tue, 21 Jul 2009 02:02:30 -0700

Hi,

You can have Nutch crawl and index pretty much everything, for specific
protocols and formats you only need to write custom protocol, parse and
maybe even indexing plugins.

The protocol plugin, takes care of accessing the content. The parse plugin
takes care of parsing the content, extracting valuable pices and the indexer
plugin takes care of indexing it.

Nutch comes with plugins for some standard protocols, such as:
file, ftp and html
and parser plugins such as:
html, excel, word, powerpoint, pdf, rss, flash, zip

if you need more you either have to look for a specific extension, make it
your self or find someone to do it for you.

best regards,
Magnus
http://lucenejobs.com

On Tue, Jul 21, 2009 at 5:59 AM, Arshad Khan <[email protected]>wrote:

> Hello
>
> I am working on a project that requires downloading full text research
> papers from PubMed automatically. Can Nutch provide such capability? If yes
> then are there any examples or tutorials that can help to get Nutch
> configured for this purpose?
>
> TIA
> Arshad
>

Re: Using Nutch to crawl PubMed

Reply via email to