Customizing crawling

Ricardo J. Méndez Wed, 21 Feb 2007 19:41:34 -0800

Hi,

I posted this to nutch-agent as well, but nutch-user seems to be more
active.


I've got a few questions about customizing the crawling process.  I
tried checking out the Wiki, but many of the pages linked from
"Becoming a Nutch Developer" are still unwritten, so any pointers you
can provide would be very welcome.

While some of the issues were covered on the recent "focused crawls"
thread, I still have a few questions.

1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
it to follow some <link /> references from the page's Header.  I know
that I can obtain the link reference with a Parse plugin, but how should
I add the reference to the list of items to be crawled?

2) Which type of plugin or response from one - if any - determines what
items go into the database?  For instance, can I write a plugin that
returns "false" if I don't want the database to store a PDF, or a Word
document?  Or maybe a specific page, based on something found by a Parse
plugin?

Thanks in advance,



Ricardo J. Méndez
http://ricardo.strangevistas.net/

Customizing crawling

Reply via email to