Crawling password protected sites would require two things: 1. being able to submit data to auth page via post, as most do not accept the login in the query string, some do, but most dont. 2. being able to manage the session during the crawl, so that the server thinks the agent is stilled logged in as it goes from page to page. I did this in an intelligent agent I wrote about 6 years ago, but I don't know enough about the nutch agent to tell if it is possible.
-----Original Message----- From: Mohini Padhye [mailto:[EMAIL PROTECTED] Sent: Thursday, March 02, 2006 4:26 PM To: [email protected] Subject: RE: https plugin for Nutch Sameer, Thanks for the reply. I could configure and use protocol-http plugin for crawling site that's using https protocol. Also, has anyone worked with crawling password protected sites? My requirement is crawling an intranet site that uses https and user authentication. I searched through the forum but couldn't find anybody who has successfully implemented it. I'm also going through the source files for protocol-http plugin to see if any changes can be made there for my specific requirement. Thanks, Mohini -----Original Message----- From: Sameer Tamsekar [mailto:[EMAIL PROTECTED] Sent: Wednesday, March 01, 2006 10:31 PM To: [email protected] Subject: Re: https plugin for Nutch If you use protocol-httpclient (versus protocol-http) then it should support https. I have got this reply from one of the mailing list user. Regards, Sameer On 3/2/06, Mohini Padhye <[EMAIL PROTECTED]> wrote: > > I am using nutch-0.7.1. I wanted to know if anyone has successfully > implemented https plugin for nutch. > If not, can someone provide guidelines about developing it and I can > start with the implementation? > -Mohini > > ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
