Thanks for your reply.

OK, now I got two home lessons:
- Create a Jira issue about this
- Explain how it is possible to use ExtractingRequestHandler with Solr 1.4.1 by copying jars etc.

BTW, I just figured out that Tika parses all the meta tag information, so I can rewrite the ExtractingRequestHandler classes in order to skip files with these meta directives. The following was included into my index last time i started the ManifoldCF job:
<arr name="ignored_meta">
<str>robots</str>
<str>noindex,nofollow</str>

I have already rewritten some of these classes in order to implement language detection, so it seems that we can implement all the functionality we need by using ManifoldCF. :)

Erlend

On 27.01.11 16.37, Karl Wright wrote:
There's also ordering; the meta tag must precede all links on the page
that you don't want the crawler to follow.  Hope this is OK.

Karl

On Thu, Jan 27, 2011 at 10:16 AM, Karl Wright<daddy...@gmail.com>  wrote:
Sure, please open a ticket.
Interpreting the tag should not be difficult.  The main issues will be
around noting the crawler's decision to skip documents or content in
the activities history.  And, of course, this will not be available in
the ManifoldCF-0.1-incubating release.

Please specify what variants of the tag you think should be supported,
and if supported, how you think it should work.  For example,
including "nofollow" does not usually block crawlers from reaching
your linked documents from other directions; if you want that
functionality, you probably won't find that anywhere.  This is why
most people use robots.txt rather than the meta tag.

Karl


On Thu, Jan 27, 2011 at 10:04 AM, Erlend Garåsen
<e.f.gara...@usit.uio.no>  wrote:

I just figured out that the web crawler does not follow the rules defined by
the robots meta tag. I created a document with the following tag:
<meta name="robots" content="noindex, nofollow">

This document has also a link to another document in order to test the
"nofollow" rule, but both documents were fetched and indexed by Solr.

Should I open a Jira issue about this? I hope it's easy to rewrite the
crawler in order to add this functionality since this is a blocker for us.

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050




--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to