Hi, I've got a few questions about customizing the crawling process. I tried checking out the Wiki, but many of the pages linked from "Becoming a Nutch Developer" are still unwritten, so any pointers you can provide would be very welcome.
1) Which types of links does Nutch follow? Only HREFs? If so, I'd like it to follow some <link /> references from the page's Header. I know that I can obtain the link reference with a Parse plugin, but how should I add the reference to the list of items to be crawled? 2) Which type of plugin or response from one - if any - determines what items go into the database? For instance, can I write a plugin that returns "false" if I don't want the database to store a PDF, or a Word document? Or maybe a specific page, based on something found by a Parse plugin? Thanks in advance, Ricardo J. Méndez http://ricardo.strangevistas.net/