Crawler should follow the robots meta tag rules
-----------------------------------------------

                 Key: CONNECTORS-153
                 URL: https://issues.apache.org/jira/browse/CONNECTORS-153
             Project: ManifoldCF
          Issue Type: Improvement
          Components: Web connector
    Affects Versions: ManifoldCF 0.1
            Reporter: Erlend GarĂ¥sen
             Fix For: ManifoldCF next


The web crawler does obey robots.txt files, but not the robots meta tag rules. 
If a document has the following meta tag included, the crawler just ignores and 
fetches it anyway:
<meta name="robots" content="noindex, nofollow" />

I would recommend that the following changes are done in order to improve the 
crawler if one of the "Obey robots.txt ..." options is set:

1. <meta name="robots" content="noindex, nofollow" />
- do not fetch the document at all

2. <meta name="robots" content="noindex, follow" />
- only follow the other links in this document

3. <meta name="robots" content="index, nofollow" />
- fetch the document, but do no follow any link in it.

4. Change most of the text that appear on the page for robots option settings 
to something like:
"Robots.txt usage" => "Robots.txt and Robots <meta> tag usage"
"Don't look at robots.txt" => "Ignore robots settings"
"Obey robots.txt for data caches only" => "Follow robots rules for data caches 
only"
"Obey robots.txt for all fetces" => "Follow robots rules for all fetches"



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to