I don't know Dan, but its something on my list too. I kind of doubt that this is a feature in nutch, because generally this is thought of as specialized intelligent agent (IA) capability instead of more general spidering/indexing technology. Certainly it is possible to do, but there are two problems that need to be addressed to get any IA to do this deed for you.
Firstly HTTPS has generally nothing to do with this, as this is just another protocol that the agent has to support. I think that nutch does support https, as indicated by several prior posts, but can someone confirm? The following technology could probably be implemented as a plug-in to nutch, but some pretty subtantial work would need to be done. 1. You have to be able to address a login page for a set of content, in other words you have to tell your IA this is the page where I need to submit the login credentials to gain access. Your IA must also be able to look up your credentials from a database, and submit those credentials via a name-value pair to the server via an HTTP post (in most scenarios). 2. You have to specify a page to begin crawling once your credientials have been accepted. You shouldn't rely on the redirect that takes place to be you start page, as there probably is specific content you are after. 3. You IA must be able to manage the session with the web server. Most authentication schemes rely on a flag on the server that indicates that you are logged in, If the IA does not resend the session cookie properly, the web server will think you are logged off. Once you have the servers content indexed, when the user of your serach engine clicks on one of the links, she would have to submit her credentials to agin access anyway. Automatically including the credentials originally used to acess the content is possible, but would probably raise the ire of the sysadmin, and also expose the credentials you you used to access the content in the first place to the world. I can honeslty say nutch is not the right solution for everything. If you are after indexing content behind a wall, its probably best to use some code better suited to the task, unless someone has made a truly custom hack for nutch in this area already. I recently dusted off a book that I have entitled Programming Bots, Spiders, and Intelligent Agents in Microsoft Visual C++, by David Pallman, which was published way back in 1999. Suprisingly, not much has changed since then. This would be a good read for those aspiring to know more about the topic. -----Original Message----- From: Dan Fundatureanu [mailto:[EMAIL PROTECTED] Sent: Thursday, March 09, 2006 12:03 PM To: [email protected] Subject: Indexing a web site over HTTPS using username/passwd Hi, Could you point me were I can find some info about how can I use NUTCH to crawl over a website where the access is provided only via HTTPS using username/passwd? Are there any config settings that I have to do or do I have to hack in the code to change this? Thanks, Dan Fundatureanu ------------------------------------------------------- This SF.Net email is sponsored by xPML, a groundbreaking scripting language that extends applications into web and mobile media. Attend the live webcast and join the prime developer group breaking into this new coding territory! http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
