[
https://issues.apache.org/jira/browse/CONNECTORS-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13409076#comment-13409076
]
Karl Wright commented on CONNECTORS-489:
----------------------------------------
Mail from Rene:
We are now able to connect to the IIS proxy, thanks to the added logging
facilities by Karl, we were able to see that this is the fix :
{code}
Index:
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
===================================================================
---
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(revision 1357379)
+++
connectors/webcrawler/connector/src/main/java/org/apache/manifoldcf/crawler/connectors/webcrawler/WebcrawlerConnector.java
(working copy)
@@ -361,7 +361,7 @@
String emailAddress =
params.getParameter(WebcrawlerConfig.PARAMETER_EMAIL);
if (emailAddress == null)
throw new ManifoldCFException("Missing email address");
- userAgent = "ApacheManifoldCFWebCrawler; "+emailAddress+")";
+ userAgent = "Mozilla/5.0 (ApacheManifoldCFWebCrawler; "+emailAddress+")";
from = emailAddress;
x = params.getParameter(WebcrawlerConfig.PARAMETER_ROBOTSUSAGE);
{code}
Yes, this is weird, a proxy shouldn't fail on User-Agent settings, but
apparently this one does.
Even Google apparently does this :
http://www.useragentstring.com/pages/Googlebot/
> Some proxies restrict access based on User-Agent header
> -------------------------------------------------------
>
> Key: CONNECTORS-489
> URL: https://issues.apache.org/jira/browse/CONNECTORS-489
> Project: ManifoldCF
> Issue Type: Improvement
> Components: RSS connector, Web connector
> Affects Versions: ManifoldCF 0.6
> Reporter: Karl Wright
> Assignee: Karl Wright
> Fix For: ManifoldCF 0.6
>
>
> Some ISA proxies restrict access to content based on User-Agent. We need to
> have a user-agent header that doesn't fail on these sites.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira