At Fri, 20 Aug 2010 20:39:47 +0100, Timo Sirainen wrote: > > On Fri, 2010-08-20 at 09:53 -0700, Erik Hetzner wrote: > > > For what it’s worth, here are the “invalid XML character”s being > > complained about by Solr: > > Oh. It's not about illegal UTF8 sequences, but about some unicode > characters actually not being valid for XML. Hopefully these help: > > http://hg.dovecot.org/dovecot-1.2/rev/5efba9f9f0a7 > http://hg.dovecot.org/dovecot-1.2/rev/cf0da2cd31fb
Hi Timo, Unfortunately this second changeset (cf0da2cd31fb) seems to have introduced a bug that results ever other character being dropped from strings before they are indexed. For instance, my username `egh` becomes `eh`, `spam` becomes `sa`, `drafts` becomes `dat`, etc. Furthermore I am not sure that the UTF-8 code is working as expected. Attached is a patch which fixes the problem with every second character being dropped & results in a solr index that can be searched for unicode characters (at least I tested it with latin accents and with greek) best, Erik
solr_unicode.diff
Description: Binary data
pgp0Jy9nGY0lR.pgp
Description: PGP signature
