Good morning everyone,

I've been spending a bit of time with Nutch lately - it's looking like a
really solid product - but I've a couple of questions that I need to
resolve before I can really say whether or not Nutch will work for my
particular situation.

<Note - the rest of this email presume some knowledge of server-side
knowledge.>

I've pointed Nutch at a relatively small J2EE-based intranet and am
currently performing and intranet crawl, per the Nutch tutorial.  As per
the J2EE spec, the presentation tier utilizes the jsessionid token to
maintain client state.  Right now, I'm seeing my pages perform
accordingly to non-cookied clients (Nutch) and serialize the jsessionid
onto the generated link (foo.jsp;jsessionid=XXXXXXXXX), and while this
works, the urls that Nutch stores in the index contain the jsessionid
token (yes, it works, but it's a bit confusing and unnecessary).  

What I'd like to see is Nutch obey the standard cookie model for any
cookies returned by the requested domain.  I realize this probably
doesn't scale well for the web crawls but it does make a lot of sense
for small intranet crawls.

Am I missing some command-line argument that will tell Nutch to play
well with cookies?

Thanks for any assistance you can offer,
Cory Wilkerson


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to