Does anybody know how to set another character
encoding than UTF-8, which seems to be the default in
Nutch 0.8.1 on Tomcat 5 ? (Ubuntu 6.10 / Tomcat 5.0)

What I have tried :

In <tomcat_root>/conf/web.xml :
(in jsp section) :
Added :
<init-param>
<param-name>javaEncoding</param-name>
<param-value>ISO-8859-1</param-value>
</init-param>

In <tomcat_root>/webapps/ROOT/WEB-INF/web.xml :
(in <servlet-name>Cached</servlet-name> section)
Added :
<init-param>
<param-name>javaEncoding</param-name>
<param-value>ISO-8859-1</param-value>
</init-param>

Stopped and restarted Tomcat (from the crawldir folder
of Nutch)

The browser keeps showing UTF-8 encoded pages, and
french special characters are being replaced with
wrong characters.

I'm not a .jsp jock, but I believe the UTF-8 encoding is baked into the pages. See this search (http://krugle.com/kse/files?query=utf-8&lang=jsp&project=nutch), where you'll get a bunch of .jsp pages in Nutch that have the UTF-8 encoding in the HTML sections.

But leaving that aside, in general UTF-8 is the safest encoding to use. If a browser is showing "wrong characters", and the browser is relatively new, then my guess would be that there was an encoding problem when the data was initially parsed. So it wound up in the Nutch segments/index with the wrong value.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Reply via email to