Hi,
I have some problems with searching the pages with international characters. It seems to me that Nutch when crawling silently ignores character encodings declared in headers / META tags, and probably assumes the default platform encoding. This Is Bad, as it leads to mixed codepage characters in the index, and to mixed-encoded cached content. It's almost impossible to search or display consistently such mixed content. E.g. for Polish language there are at least two/three popular encodings different than ISO-8859-1, and the text becomes garbled when the code assumes this encoding..
I tested it at Mozdex as well - it appears to me that Byron added something or other, because at least part of the sites in ISO-8859-2 appear to work correctly... But try the following query:
http://www.mozdex.com/search.jsp?query=polski+sejm&hitsPerPage=10
One of the top hits should be this: BIP (http://www.bip.gov.pl/). One of the words in the snippet is "RzÄdowa" (governmental). Naturally, I shold be able to find this by entering this word as a query, right? Nope. The following query:
http://www.mozdex.com/search.jsp?query=rz%26%23261%3Bdowa&hitsPerPage=10
returns no hits. The reason is clear when you look at the cached page (http://www.mozdex.com/cached.jsp?idx=0&id=3220188): "RzÄdowa" becomes "RzÃdowa", which is the result of displaying Latin2 characters using Latin1 codepage.
I hacked my way around at least some of these issues by adding the attached patch, and using UTF-8 systematically in every place that outputs the content. However, this doesn't take into account the META tag, or missing "charset=" headers. Any suggestions?
-- Best regards, Andrzej Bialecki
------------------------------------------------- Software Architect, System Integration Specialist CEN/ISSS EC Workshop, ECIMF project chair EU FP6 E-Commerce Expert/Evaluator ------------------------------------------------- FreeBSD developer (http://www.freebsd.org)
? patch.diff
Index: HttpResponse.java
===================================================================
RCS file:
/cvsroot/nutch/nutch/src/java/net/nutch/net/protocols/http/HttpResponse.java,v
retrieving revision 1.3
diff -b -d -u -r1.3 HttpResponse.java
--- HttpResponse.java 29 Apr 2004 20:36:12 -0000 1.3
+++ HttpResponse.java 26 May 2004 17:09:44 -0000
@@ -215,7 +215,25 @@
if (socket != null)
socket.close();
}
-
+ // try to convert to UTF-8
+ String contentType = getHeader("Content-Type");
+ // no content type or not text - give up
+ if (contentType == null || !contentType.startsWith("text/")) return;
+ contentType = contentType.toLowerCase();
+ int idx = contentType.indexOf("charset=");
+ String encoding = null;
+ if (idx == -1) {
+ // unspecified encoding... assume the most common
+ encoding = "ISO-8859-1";
+ } else encoding = contentType.substring(idx + 8);
+ if (encoding.equals("utf-8")) return;
+ try {
+ String newContent = new String(content, encoding);
+ content = newContent.getBytes("UTF-8");
+ } catch (Exception e) {
+ // *shrug* ignore
+ //e.printStackTrace();
+ }
}
private void readPlainContent(InputStream in)
