I've got a few questions about customizing the crawling process.  I
tried checking out the Wiki, but many of the pages linked from
"Becoming a Nutch Developer" are still unwritten, so any pointers you
can provide would be very welcome.

1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
it to follow some <link /> references from the page's Header.  I know
that I can obtain the link reference with a Parse plugin, but how should
I add the reference to the list of items to be crawled?

2) Which type of plugin or response from one - if any - determines what
items go into the database?  For instance, can I write a plugin that
returns "false" if I don't want the database to store a PDF, or a Word
document?  Or maybe a specific page, based on something found by a Parse

Thanks in advance,

Ricardo J. Méndez

Reply via email to