[
https://issues.apache.org/jira/browse/CONNECTORS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Schuch updated CONNECTORS-1392:
--------------------------------------
Description:
The Web connectors already allows to ignore robots.txt by option.
With this ticket, another option is added, to allow the connector to ignore
robots instructions in {{<meta name="robots ...}} tags and {{<a ...
rel="nofollow" ...}} attributes.
*Proposal (to be discussed)*
Add a new option list "Page level robots instructions" to the "Robots" Tab.
List entries:
# Obey meta robots tags (the default)
# Don't took at meta robots tags
The end user doc needs to be updated.
Google ressources on robot instructions in HTML pages:
[0]
https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4
[1]
https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3
[2]
https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?csw=1
Thread on the mailing list
[3] https://www.mail-archive.com/[email protected]/msg03258.html
was:
The Web connectors already allows to ignore robots.txt by option.
With this ticket, another option is added, to allow the connector to ignore
robots instructions in {{<meta name="robots ...}} tags and {{<a ...
rel="nofollow" ...}} attributes.
*First proposal (to be discussed)*
Reuse the existing "Robots.txt usage" option in the "Robots" Tab. Rename the
existing options:
# Don't look at robots.txt, meta robots and rel attributes
# Obey robots.txt, meta robots tags and rel attributes for data fetches only
# Obey robots.txt, meta robots tags and rel attributes _(the default)_
The end user doc needs to be updated.
Google ressources on robot instructions in HTML pages:
[0]
https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4
[1]
https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3
Thread on the mailing list
[2] https://www.mail-archive.com/[email protected]/msg03258.html
> Add option for Web connector to ignore robots instructions in meta tags and
> rel attributes
> ------------------------------------------------------------------------------------------
>
> Key: CONNECTORS-1392
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1392
> Project: ManifoldCF
> Issue Type: New Feature
> Components: Web connector
> Reporter: Markus Schuch
>
> The Web connectors already allows to ignore robots.txt by option.
> With this ticket, another option is added, to allow the connector to ignore
> robots instructions in {{<meta name="robots ...}} tags and {{<a ...
> rel="nofollow" ...}} attributes.
> *Proposal (to be discussed)*
> Add a new option list "Page level robots instructions" to the "Robots" Tab.
> List entries:
> # Obey meta robots tags (the default)
> # Don't took at meta robots tags
> The end user doc needs to be updated.
> Google ressources on robot instructions in HTML pages:
> [0]
> https://support.google.com/webmasters/answer/79812?hl=en&ctx=cb&src=cb&cbid=tnnsjq5jcodt&cbrank=4
> [1]
> https://support.google.com/webmasters/answer/96569?hl=en&ctx=cb&src=cb&cbid=-5rmggrfsp2rq&cbrank=3
> [2]
> https://developers.google.com/webmasters/control-crawl-index/docs/robots_meta_tag?csw=1
> Thread on the mailing list
> [3] https://www.mail-archive.com/[email protected]/msg03258.html
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)