Hi, I posted this to nutch-agent as well, but nutch-user seems to be more active.
I've got a few questions about customizing the crawling process. I tried checking out the Wiki, but many of the pages linked from "Becoming a Nutch Developer" are still unwritten, so any pointers you can provide would be very welcome. While some of the issues were covered on the recent "focused crawls" thread, I still have a few questions. 1) Which types of links does Nutch follow? Only HREFs? If so, I'd like it to follow some <link /> references from the page's Header. I know that I can obtain the link reference with a Parse plugin, but how should I add the reference to the list of items to be crawled? 2) Which type of plugin or response from one - if any - determines what items go into the database? For instance, can I write a plugin that returns "false" if I don't want the database to store a PDF, or a Word document? Or maybe a specific page, based on something found by a Parse plugin? Thanks in advance, Ricardo J. Méndez http://ricardo.strangevistas.net/ ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
