Good morning everyone, I've been spending a bit of time with Nutch lately - it's looking like a really solid product - but I've a couple of questions that I need to resolve before I can really say whether or not Nutch will work for my particular situation.
<Note - the rest of this email presume some knowledge of server-side knowledge.> I've pointed Nutch at a relatively small J2EE-based intranet and am currently performing and intranet crawl, per the Nutch tutorial. As per the J2EE spec, the presentation tier utilizes the jsessionid token to maintain client state. Right now, I'm seeing my pages perform accordingly to non-cookied clients (Nutch) and serialize the jsessionid onto the generated link (foo.jsp;jsessionid=XXXXXXXXX), and while this works, the urls that Nutch stores in the index contain the jsessionid token (yes, it works, but it's a bit confusing and unnecessary). What I'd like to see is Nutch obey the standard cookie model for any cookies returned by the requested domain. I realize this probably doesn't scale well for the web crawls but it does make a lot of sense for small intranet crawls. Am I missing some command-line argument that will tell Nutch to play well with cookies? Thanks for any assistance you can offer, Cory Wilkerson ------------------------------------------------------- SF.Net email is Sponsored by the Better Software Conference & EXPO September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
