[Nutch-general] Customizing crawling

Ricardo J. Méndez Wed, 21 Feb 2007 19:42:06 -0800

Hi,

I posted this to nutch-agent as well, but nutch-user seems to be more
active.


I've got a few questions about customizing the crawling process.  I
tried checking out the Wiki, but many of the pages linked from
"Becoming a Nutch Developer" are still unwritten, so any pointers you
can provide would be very welcome.

While some of the issues were covered on the recent "focused crawls"
thread, I still have a few questions.

1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
it to follow some <link /> references from the page's Header.  I know
that I can obtain the link reference with a Parse plugin, but how should
I add the reference to the list of items to be crawled?

2) Which type of plugin or response from one - if any - determines what
items go into the database?  For instance, can I write a plugin that
returns "false" if I don't want the database to store a PDF, or a Word
document?  Or maybe a specific page, based on something found by a Parse
plugin?

Thanks in advance,



Ricardo J. Méndez
http://ricardo.strangevistas.net/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Customizing crawling

Reply via email to