[Nutch-general] robot exclusion portional of a document

Alexander E Genaud Tue, 16 May 2006 04:39:12 -0700

Hello,

As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?


The benefit is most evident in the search hit result description
(snippets) which will often contain navigation links that may not give
useful information about a page. As far as I know, there is no
standard. Does Nutch provide a method for document section exclusion?
Some methods I've seen include:



<!-- robots content="none" -->

not to be indexed

<!-- /robots -->



<!-- FreeFind Begin No Index -->

not to be indexed

<!-- FreeFind End No Index -->


If there is no such feature and this is deemed useful, I would be
willing to implement this feature in code.

Alex
--
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] robot exclusion portional of a document

Reply via email to