Hello,
I proposed a idea. You could use a especial tag like meta in the body. This
tag do not show in html browser and do not need HTML comment.
<html>
<body>
HELLO
<meta name="robots" content="noindex">
<p>
HELLO NO INDEX
</p>
</meta>
</body>
</html>
"Nutch Newbie"
<[EMAIL PROTECTED]
il.com> Para
[email protected]
18/05/2006 08:49 cc
Asunto
Por favor, Re: robot exclusion portional of a
responda a document
[EMAIL PROTECTED]
.apache.org
On 5/16/06, Alexander E Genaud <[EMAIL PROTECTED]> wrote:
> Hello,
>
> As far as I understand, /robots.txt designates which files may and may
> not be indexed by the Nutch and other crawlers. However, is there a
> method by which site may exclude only sections of a document?
>
> The benefit is most evident in the search hit result description
> (snippets) which will often contain navigation links that may not give
> useful information about a page. As far as I know, there is no
> standard. Does Nutch provide a method for document section exclusion?
> Some methods I've seen include:
It would be cool if one could remove the menu-type navigation content
from the summary snippets. How about removing all <a href=>yyy</a>
from summary? Summary strips out all html tags but isn't it a good
idea that all <a href> tags gets extra care i.e it removes the
<content> in between. Will it be a good idea to apply the same
principle in Javascripts as well. I am very curious to know if there
are use cases where the above practice brings more minus then plus.
>From the top of my head ...if <a
href=http://abc.co.uk>http://abc.co.uk/</a> where the link is a URL in
such case we will be loosing some info from the summary.. Yes its much
better then having menu/javascripts etc in the summary which gives no
values to user at all.
Any ideas how one could go about fixing the summary problem...
>
> <!-- robots content="none" -->
>
> not to be indexed
>
> <!-- /robots -->
>
>
>
> <!-- FreeFind Begin No Index -->
>
> not to be indexed
>
> <!-- FreeFind End No Index -->
>
>
> If there is no such feature and this is deemed useful, I would be
> willing to implement this feature in code.
>
> Alex
> --
> CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
>
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general