On 5/16/06, Alexander E Genaud <[EMAIL PROTECTED]> wrote:
Hello,
As far as I understand, /robots.txt designates which files may and may
not be indexed by the Nutch and other crawlers. However, is there a
method by which site may exclude only sections of a document?
The benefit is most evident in the search hit result description
(snippets) which will often contain navigation links that may not give
useful information about a page. As far as I know, there is no
standard. Does Nutch provide a method for document section exclusion?
Some methods I've seen include:
It would be cool if one could remove the menu-type navigation content
from the summary snippets. How about removing all <a href=>yyy</a>
from summary? Summary strips out all html tags but isn't it a good
idea that all <a href> tags gets extra care i.e it removes the
<content> in between. Will it be a good idea to apply the same
principle in Javascripts as well. I am very curious to know if there
are use cases where the above practice brings more minus then plus.
From the top of my head ...if <a
href=http://abc.co.uk>http://abc.co.uk/</a> where the link is a URL in
such case we will be loosing some info from the summary.. Yes its much
better then having menu/javascripts etc in the summary which gives no
values to user at all.
Any ideas how one could go about fixing the summary problem...
<!-- robots content="none" -->
not to be indexed
<!-- /robots -->
<!-- FreeFind Begin No Index -->
not to be indexed
<!-- FreeFind End No Index -->
If there is no such feature and this is deemed useful, I would be
willing to implement this feature in code.
Alex
--
CCC7 D19D D107 F079 2F3D BF97 8443 DB5A 6DB8 9CE1
-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general