Nutch 0.7 default plugin-includes property does not include
protocol-httpclient. After it's added, crawling does recognize https
urls. Thanks. However, there are still two kinds of error related to
https.
(1) NoRouteToHostException. It occurs very often, for example,
050910 150336 fetching https://www.picoscript.com/products.aspx
050910 150336 fetch of https://www.picoscript.com/products.aspx failed
with: java.lang.Exception: java.net.NoRouteToHost
Exception: No route to host: connect
(2) does not recognize https url redirected from http url. It occurs
very often. for example,
050910 150341 fetch of
http://www.cellsciences.com/content/c2-contact.asp failed with:
java.lang.Exception: org.apache.n
utch.protocol.http.HttpException: Not an HTTP
url:https://www.cellsciences.com/content/c2-contact.asp
Any idea what happens?
-AJ
Andrzej Bialecki wrote:
AJ Chen wrote:
Andrzej, Thanks.
A related question: Some of the sites I crawl use https: or redirect
to https:. Nutch default setting does not recognize https: as valid
url. Is there a way to crawl url starting with "https:"?
Which version of Nutch? 0.7 recognizes and supports https urls,
through the protocol-httpclient plugin.