Wilkerson, Cory wrote:
Good morning everyone,
I've been spending a bit of time with Nutch lately - it's looking like a
really solid product - but I've a couple of questions that I need to
resolve before I can really say whether or not Nutch will work for my
particular situation.
<Note - the rest of this email presume some knowledge of server-side
knowledge.>
I've pointed Nutch at a relatively small J2EE-based intranet and am
currently performing and intranet crawl, per the Nutch tutorial. As per
the J2EE spec, the presentation tier utilizes the jsessionid token to
maintain client state. Right now, I'm seeing my pages perform
accordingly to non-cookied clients (Nutch) and serialize the jsessionid
^^^^^^^^^^^^^^^^^^^
Recent development versions of Nutch use protocol-httpclient plugin to
handle HTTP, and this plugin supports cookies. Whic version are you using?
onto the generated link (foo.jsp;jsessionid=XXXXXXXXX), and while this
works, the urls that Nutch stores in the index contain the jsessionid
token (yes, it works, but it's a bit confusing and unnecessary).
This can be removed through a regular expression in
conf/regex-normalizer.xml
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com