[Nutch-general] RE: Crawling a page for links, but not indexing it

Vanderdray, Jacob Thu, 17 Nov 2005 09:47:03 -0800

Dean,

        I'm not sure if the nutch crawler actually supports it, but you
should be able to use a robots noindex Meta tag in the archive pages.

See http://www.robotstxt.org/wc/meta-user.html for more information.

Jake.

-----Original Message-----
From: Dean Elwood [mailto:[EMAIL PROTECTED] 
Sent: Thursday, November 17, 2005 12:34 PM
To: [email protected]
Subject: Crawling a page for links, but not indexing it

I'm indexing a lot of pages which are archives - they contain both a
link to 
the original article, and part of the text of the original article.

So ideally I want to crawl the "parent" archive page and index
everything it 
links to, but I don't actually want to index the "parent" page itself.

I hope that makes sense...

Is this possible? I'm using the intranet crawling method.

Many thanks,

Dean 

-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.  Get Certified Today
Register for a JBoss Training Course.  Free Certification Exam
for All Training Attendees Through End of 2005. For more info visit:
http://ads.osdn.com/?ad_idv28&alloc_id845&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] RE: Crawling a page for links, but not indexing it

Reply via email to