[ http://issues.apache.org/jira/browse/NUTCH-28?page=comments#action_62131
]
Doug Bakewell commented on NUTCH-28:
------------------------------------
I have tried the attached code. For simple https pages it worked fine. We have
a few https pages which redirect to a login page. These pages gave the
following exception but crawling continued and the resulting database gave
results as expected, without the login page and without the redirect page.
Also, I'm not sure if the first line of the log below is related.
I'm not sure if this is a real problem. Maybe it just needs to be dealt with
somewhere to suppress the output.
050401 021440 Going to buffer response body of large or unknown size. Using
getResponseAsStream instead is recommended.
050401 021440 Error getting URI host
org.apache.commons.httpclient.HttpException: Redirect from host demo.nfis.org
to ca.nfis.org is not supported
at
org.apache.commons.httpclient.HttpMethodBase.checkValidRedirect(HttpMethodBase.java:1237)
at
org.apache.commons.httpclient.HttpMethodBase.processRedirectResponse(HttpMethodBase.java:1185)
at
org.apache.commons.httpclient.HttpMethodBase.isRetryNeeded(HttpMethodBase.java:967)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java:1089)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:643)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:497)
at net.nutch.protocol.https.HTTPS.getContent(HTTPS.java:22)
at net.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:107)
050401 021440 Invalid Redirect URI from:
https://demo.nfis.org:443/mapserver/nai.phtml to:
https://ca.nfis.org/access/login.jsp?DACS_ERROR_CODE=902&DACS_VERSION=1.2&DACS_FEDERATION=nfis.org&DACS_JURISDICTION=DEMO&DACS_HOSTNAME=demo.nfis.org&DACS_USER_AGENT=Jakarta%20Commons-HttpClient%2f2.0.2&DACS_ERROR_URL=https://demo.nfis.org:443/mapserver/nai.phtml
> No support for https
> --------------------
>
> Key: NUTCH-28
> URL: http://issues.apache.org/jira/browse/NUTCH-28
> Project: Nutch
> Type: Improvement
> Reporter: Stefan Grroschupf
> Attachments: protocol-https.tgz
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=986240&group_id=59548&atid=491356
> submitted by:
> Konstantin Ignatyev
> Crawl tool does not support https protocol.
> I have created very simple one based on
> commons-httpclient and attached it to the report. It
> seems working although required commons-httpclient.jar
> and commons-logging.jar in lib directory.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira