You might want to check out this page
http://wiki.apache.org/solr/SolrTomcat

Tomcat needs a small config change out of the box to properly support UTF-8. 


Thanks,
Charlie


-----Original Message-----
From: Mario Knezovic [mailto:[EMAIL PROTECTED] 
Sent: Friday, August 17, 2007 12:58 PM
To: solr-user@lucene.apache.org
Subject: UTF-8 encoding problem on one of two Solr setups

Hi all,

I have set up an identical Solr 1.1 on two different machines. One works
fine, the other one has a UTF-8 encoding problem.

#1 is my local Windows XP machine. Solr is running basically in a
configuration like in the tutorial example with Jetty/5.1.11RC0 (Windows
XP/5.1 x86 java/1.6.0). Everything works fine here as expected.

#2 is a Linux machine with Solr running inside Tomcat 6. The problem happens
here. This is the place where Solr will be running finally.

To rule out all problems in my PHP and Java code, I tested the problem with
the Solr admin page and it happens there as well. (Tested with Firefox 2
with site's char encoding UTF-8.)

When entering an arbitrary search string containing UTF-8 chars I get a
correct response from the local Windows Solr setup:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
 <int name="status">0</int>
 <int name="QTime">0</int>
 <lst name="params">
  <str name="indent">on</str>
  <str name="start">0</str>
  <str name="q">München</str>  <-- sample string containing a German
umlaut-u
  <str name="rows">10</str>
  <str name="version">2.2</str>
 </lst>
</lst>
[...]

When I do exactly the same, just on the admin page of the other Solr setup
(but from exactly the same browser), I get the following response:

[...]
<str name="q">item$searchstring_de:München</str>
[...]

Obviously the umlaut-u UTF-8 bytes 0xC3 0xB6 had been interpreted as two
8-bit chars instead of one UTF-8 char.

Unfortunately I am pretty new to Solr, Tomcat and related topics, so I was
not able to find the problem yet. My guess is that it is outside of Solr,
maybe in the Tomcat configuration, but so far I spent the entire day without
a further clue.

But apart from that Solr really rocks. Indexing tons of content and
searching works just fine and fast and it was pretty easy to get into
everything. Now I am changing all data to UTF-8 and ran into my first
serious obstacle... after a few weeks of Solr usage!

Any hint/help appreciated. Thank you very much.

Mario

Reply via email to