AJ Chen wrote:
Nutch 0.7 default plugin-includes property does not include protocol-httpclient. After it's added, crawling does recognize https urls. Thanks. However, there are still two kinds of error related to https.

(1) NoRouteToHostException.  It occurs very often, for example,

050910 150336 fetching https://www.picoscript.com/products.aspx
050910 150336 fetch of https://www.picoscript.com/products.aspx failed with: java.lang.Exception: java.net.NoRouteToHost
Exception: No route to host: connect


JDK API has this to say: "Typically, the remote host cannot be reached because of an intervening firewall, or if an intermediate router is down.". But perhaps this could also be a symptom of overloaded DNS (which couldn't resolve in time the name to IP address)...

(2) does not recognize https url redirected from http url. It occurs very often. for example,

050910 150341 fetch of http://www.cellsciences.com/content/c2-contact.asp failed with: java.lang.Exception: org.apache.n utch.protocol.http.HttpException: Not an HTTP url:https://www.cellsciences.com/content/c2-contact.asp

Any idea what happens?

Not yet... I need to test this scenario myself. Stay tuned...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to