Another way of crawling password protected site, is modifying your
intranet site to allow the nutch bot to crawl the site without
authentication. Since this is your intranet site, this should be
simple. You may also have to validate against the the crawler
machine's IP while allowing the nutch bot to crawl un-authenticated.

- Ravi Chintakunta


On 3/2/06, Richard Braman <[EMAIL PROTECTED]> wrote:
> Crawling password protected sites would require two things:
>
> 1. being able to submit data to auth page via post, as most do not
> accept the login in the query string, some do, but most dont.
> 2. being able to manage the session during the crawl, so that the server
> thinks the agent is stilled logged in as it goes from page to page.  I
> did this in an intelligent agent I wrote about 6 years ago, but I don't
> know enough about the nutch agent to tell if it is possible.
>
> -----Original Message-----
> From: Mohini Padhye [mailto:[EMAIL PROTECTED]
> Sent: Thursday, March 02, 2006 4:26 PM
> To: [email protected]
> Subject: RE: https plugin for Nutch
>
>
> Sameer,
> Thanks for the reply. I could configure and use protocol-http plugin for
> crawling site that's using https protocol. Also, has anyone worked with
> crawling password protected sites? My requirement is crawling an
> intranet site that uses https and user authentication. I searched
> through the forum but couldn't find anybody who has successfully
> implemented it. I'm also going through the source files for
> protocol-http plugin to see if any changes can be made there for my
> specific requirement. Thanks, Mohini
>
>
> -----Original Message-----
> From: Sameer Tamsekar [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, March 01, 2006 10:31 PM
> To: [email protected]
> Subject: Re: https plugin for Nutch
>
> If you use protocol-httpclient (versus protocol-http) then it should
> support https.
>
> I have got this reply from one of the mailing list user.
>
> Regards,
>
> Sameer
>
> On 3/2/06, Mohini Padhye <[EMAIL PROTECTED]> wrote:
> >
> > I am using nutch-0.7.1. I wanted to know if anyone has successfully
> > implemented https plugin for nutch.
> > If not, can someone provide guidelines about developing it and I can
> > start with the implementation?
> > -Mohini
> >
> >
>
>


-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0944&bid$1720&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to