Nutch uses the default LANG set in your machine if it can not identify the document encoding. I can only resolve this by updating the /etc/sysconfig/i18n file for the default LANG in all machines of the hadoop cluster. export LANG=... doesn't work also.
-----Original Message----- From: Mathias Conradt [mailto:[EMAIL PROTECTED] Sent: terça-feira, 29 de Abril de 2008 9:10 To: [email protected] Subject: Problems with encoding (UTF-8), display of search results with special characters I searched the mailing list for this issue already and tried all suggestions, but didn't find a solution. I have a German website, the site is encoded in utf-8 and properly displayed in the browser, which detects the correct encoding and also displays all pages correctly. (I use nutch0.9 on Gentoo Linux, with JBoss and embedded Tomcat5.5.) But nutch displays the search results properly or doesn't even index the special characters properly, but display a '?' instead of German Umlauts for example (ä,ü,ö,...) - so the display is something like: ...unabh?ngige Branchenexperten pr?fen.... I already 1) set the meta data correctly as follows: <meta http-equiv="content-type" content="text/html; charset=utf-8" /> <%@ page contentType="text/html; charset=utf-8" pageEncoding="utf-8" language="java" ... 2) in nutch-site.xml I set <property> <name>parser.character.encoding.default</name> <value>utf-8</value> </property> I use Jboss with embedded Tomcat: 3) In web.xml I added a parameter <init-param> <param-name>javaEncoding</param-name> <param-value>UTF-8</param-value> </init-param> 4) In server.xml I added URIEncoding="UTF-8" into the Context 5) in the jsp-page for the search results I set request.setCharacterEncoding("UTF-8"); Still I meet the above mentioned problem and all special characters are displayed as UTF-8. Same when I use the search in the shell via bin/nutch org.apache.nutch.searcher.NutchBean searchTerm the special charactes are displayed as '?' Does anyone meet the same problem and has any idea? Thanks. -- View this message in context: http://www.nabble.com/Problems-with-encoding-%28UTF-8%29%2C-display-of-searc h-results-with-special-characters-tp16954447p16954447.html Sent from the Nutch - User mailing list archive at Nabble.com. No virus found in this incoming message. Checked by AVG. Version: 7.5.524 / Virus Database: 269.23.5/1399 - Release Date: 26-04-2008 14:17 No virus found in this outgoing message. Checked by AVG. Version: 7.5.524 / Virus Database: 269.23.5/1399 - Release Date: 26-04-2008 14:17
