Hi Yoav,
        If the content is dynamic, presumably it is stored in a
database?  I was just thinking that it might be easier to use some
database utilities to index the information.

        Do you know how to use JMeter to record the requests that a web
browser makes?  The browser uses a particular port as a proxy.  I know
that the JMeter cookie manager can save the cookies that are gathered as
part of the request.
        I'm pretty sure that nutch can use a proxy.
http://wiki.apache.org/nutch/SetupProxyForNutch

According to this page here:
http://jakarta.apache.org/jmeter/usermanual/component_reference.html#HTT
P_Cookie_Manager
you can manually add a cookie that will be used by all threads.  I am
guessing that if you set up JMeter to act as a proxy, that this thread
would be included as one of those that contains the cookie.

If the proxy thread can not have cookies added manually, then this
strategy wouldn't work. 

Patrick

-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of
Yoav Shapira
Sent: Wednesday, October 01, 2008 11:47 AM
To: [email protected]
Subject: Re: How do I crawl a site with a cookie for authentication?

Patrick,
Thank you for the answers.  More below:

2008/10/1 Patrick Markiewicz <[EMAIL PROTECTED]>:
> Is it possible for you to retrieve a resource by using the url:
> http://username:[EMAIL PROTECTED]/path/to/resource.htm

The system does not support HTTP Basic authentication at this time,
unfortunately.

> I'm not sure what level of authority you have with the intranet site.
You could do a similar >trick by crawling the local filesystem of that
site, and then just having the search page edit

The site is dynamically generated.  There are no meaningful static
files on the file system.

> If you only have your own account, and can't change any other things,
then you might be >able to use JMeter to add a cookie and have nutch use
JMeter as a proxy.  I have never

This is very intriguing.  How would I get started on this?  I've used
JMeter in the past for simple test plans, but never as an HTTP proxy.

Yoav

Reply via email to