[Nutch-dev] Content-Type and character encoding

Andrzej Bialecki Wed, 26 May 2004 11:44:35 -0700

Hi,

I have some problems with searching the pages with international characters. It seems to me that Nutch when crawling silently ignores character encodings declared in headers / META tags, and probably assumes the default platform encoding. This Is Bad, as it leads to mixed codepage characters in the index, and to mixed-encoded cached content. It's almost impossible to search or display consistently such mixed content. E.g. for Polish language there are at least two/three popular encodings different than ISO-8859-1, and the text becomes garbled when the code assumes this encoding..

I tested it at Mozdex as well - it appears to me that Byron added something or other, because at least part of the sites in ISO-8859-2 appear to work correctly... But try the following query:

http://www.mozdex.com/search.jsp?query=polski+sejm&hitsPerPage=10

One of the top hits should be this: BIP (http://www.bip.gov.pl/). One of the words in the snippet is "RzÄdowa" (governmental). Naturally, I shold be able to find this by entering this word as a query, right? Nope. The following query:

http://www.mozdex.com/search.jsp?query=rz%26%23261%3Bdowa&hitsPerPage=10

returns no hits. The reason is clear when you look at the cached page (http://www.mozdex.com/cached.jsp?idx=0&id=3220188): "RzÄdowa" becomes "RzÃdowa", which is the result of displaying Latin2 characters using Latin1 codepage.

I hacked my way around at least some of these issues by adding the attached patch, and using UTF-8 systematically in every place that outputs the content. However, this doesn't take into account the META tag, or missing "charset=" headers. Any suggestions?

--
Best regards,
Andrzej Bialecki

-------------------------------------------------
Software Architect, System Integration Specialist
CEN/ISSS EC Workshop, ECIMF project chair
EU FP6 E-Commerce Expert/Evaluator
-------------------------------------------------
FreeBSD developer (http://www.freebsd.org)

? patch.diff
Index: HttpResponse.java
===================================================================
RCS file: 
/cvsroot/nutch/nutch/src/java/net/nutch/net/protocols/http/HttpResponse.java,v
retrieving revision 1.3
diff -b -d -u -r1.3 HttpResponse.java
--- HttpResponse.java   29 Apr 2004 20:36:12 -0000      1.3
+++ HttpResponse.java   26 May 2004 17:09:44 -0000
@@ -215,7 +215,25 @@
       if (socket != null)
         socket.close();
     }
-
+    // try to convert to UTF-8
+    String contentType = getHeader("Content-Type");
+    // no content type or not text - give up
+    if (contentType == null || !contentType.startsWith("text/")) return;
+    contentType = contentType.toLowerCase();
+    int idx = contentType.indexOf("charset=");
+    String encoding = null;
+    if (idx == -1) {
+        // unspecified encoding... assume the most common
+        encoding = "ISO-8859-1";
+    } else encoding = contentType.substring(idx + 8);
+    if (encoding.equals("utf-8")) return;
+    try {
+        String newContent = new String(content, encoding);
+        content = newContent.getBytes("UTF-8");
+    } catch (Exception e) {
+        // *shrug* ignore
+        //e.printStackTrace();
+    }
   }
 
   private void readPlainContent(InputStream in)

[Nutch-dev] Content-Type and character encoding

Reply via email to