Is it possible for you to retrieve a resource by using the url: http://username:[EMAIL PROTECTED]/path/to/resource.htm
If that works, you could temporarily give a "nutchuser" an account on the site (with as little permission as possible), then crawl the intranet site, and disable the account. Then edit the nutch search page to strip out the "nutchusername:nutchpassword@" part of each URL when you present results to the user. That way, only the users who previously authenticated would have access to that resource. I'm not sure what level of authority you have with the intranet site. You could do a similar trick by crawling the local filesystem of that site, and then just having the search page edit each URL to replace the file system path with a URL path that would work for a logged in user. If you only have your own account, and can't change any other things, then you might be able to use JMeter to add a cookie and have nutch use JMeter as a proxy. I have never done this, so I don't actually remember if JMeter can add a cookie to a request being made by an application that it proxies. -----Original Message----- From: Doğacan Güney [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 01, 2008 10:08 AM To: [email protected] Subject: Re: How do I crawl a site with a cookie for authentication? On Wed, Oct 1, 2008 at 4:35 PM, Yoav Shapira <[EMAIL PROTECTED]> wrote: > Hi, > > I would like to use Nutch to crawl and index an intranet web site for > internal use. The site requires authentication, and stores the > credentials in a cookie. I've got a valid login and I have the cookie > saved, no problem. How do I tell Nutch to use it? > > I did some research online before asking, but unfortunately I couldn't > find a step-by-step answer for a newbie like myself. I see there's an > http-client plugin that can support some authentication. Is that what > I should use for cookies? If so, how do I configure it? > > Or is there something else I should be doing? If the documentation / > answer exists, sorry for the hassle and please just point me to it ;) > Unfortunately, nutch doesn't have such a feature yet. (One of the problems is that we do not have a place to store cookies in a distributed setup) > -- > Thanks, > > Yoav > -- Doğacan Güney
