Hi Alex, If its not fetching https . you can try adding this https line to your crawl-urlfilter.txt file
# accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/ +^https://([a-z0-9]*\.)*(DOMAIN1|DOMAIN2)/ after adding this line it will fetch all the https urls. But i am still getting this exceptions for the https urls javax.net.ssl.SSLException: Unrecognized SSL message, plaintext connection? org.apache.nutch.protocol.http.api.HttpException: java.net.UnknownHostException: secure.americanexpress.com Vimal Varghese Koch Martina <[email protected]> 21-01-09 04:05 PM Please respond to [email protected] To "[email protected]" <[email protected]>, "[email protected]" <[email protected]> cc Subject AW: fetching https documents Hi Alex, https pages can be fetched with the protocol-httpclient plugin. Kind regards, Martina -----Ursprüngliche Nachricht----- Von: Alex Basa [mailto:[email protected]] Gesendet: Mittwoch, 21. Januar 2009 00:41 An: [email protected] Betreff: fetching https documents I searched for patches and couldn't find one. Does anyone know if nutch 0.9 supports crawling https websites? If so, can someone point me to the patch? Thanks in advance, Alex ForwardSourceID:NT0001429A =====-----=====-----===== Notice: The information contained in this e-mail message and/or attachments to it may contain confidential or privileged information. If you are not the intended recipient, any dissemination, use, review, distribution, printing or copying of the information contained in this e-mail message and/or attachments to it are strictly prohibited. If you have received this communication in error, please notify us by reply e-mail or telephone and immediately and permanently delete the message and any attachments. Thank you
