Re: [Nutch-general] Customizing crawling

Dennis Kubes Thu, 22 Feb 2007 07:40:34 -0800


Ricardo J. Méndez wrote:
> Hi,
> 
> I posted this to nutch-agent as well, but nutch-user seems to be more
> active.
> 
> I've got a few questions about customizing the crawling process.  I
> tried checking out the Wiki, but many of the pages linked from
> "Becoming a Nutch Developer" are still unwritten, so any pointers you
> can provide would be very welcome.


Which pages are still unwritten?
> 
> While some of the issues were covered on the recent "focused crawls"
> thread, I still have a few questions.
> 
> 1) Which types of links does Nutch follow? Only HREFs?  If so, I'd like
> it to follow some <link /> references from the page's Header.  I know
> that I can obtain the link reference with a Parse plugin, but how should
> I add the reference to the list of items to be crawled?

Nutch gets outlinks from the pages it parses.  This is either during the 
fetch process with parsing enabled or during just a parse process (see 
org.apache.nutch.parse.ParseSegment).  The content is parsed via plugins 
configured in parse-plugins.xml in the conf directory.  During the parse 
links are created as Outlink objects that are added to a ParseData 
object that is itself added to a Parse object.  During the writing out 
of the parse object (ParseOutputFormat) the outlinks are saved as 
CrawlDatums in the crawl_parse directory under the segment.  Then during 
the UpdateDb job (see CrawlDb) this crawl_parse is merged into the 
master Crawl Database.  That is the long answer.

Short answer is when you parse get Outlinks and add them to the 
ParseData -> Parse object and then will be updated automatically to he 
CrawlDb when the UpdateDb job is run and it will be fetched when the 
next Fetch job is run.
> 
> 2) Which type of plugin or response from one - if any - determines what
> items go into the database?  For instance, can I write a plugin that
> returns "false" if I don't want the database to store a PDF, or a Word
> document?  Or maybe a specific page, based on something found by a Parse
> plugin?

You can write url filters and url normalizers (scope outlink) that will 
prevent items from going into the CrawlDb.  Or if you are writing your 
own parse plugin, simply don't add the link to the Outlinks.

Dennis Kubes
> 
> Thanks in advance,
> 
> 
> 
> Ricardo J. Méndez
> http://ricardo.strangevistas.net/

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] Customizing crawling

Reply via email to