At Fri, 20 Aug 2010 20:39:47 +0100,
Timo Sirainen wrote:
> 
> On Fri, 2010-08-20 at 09:53 -0700, Erik Hetzner wrote:
> 
> > For what it’s worth, here are the “invalid XML character”s being
> > complained about by Solr:
> 
> Oh. It's not about illegal UTF8 sequences, but about some unicode
> characters actually not being valid for XML. Hopefully these help:
> 
> http://hg.dovecot.org/dovecot-1.2/rev/5efba9f9f0a7
> http://hg.dovecot.org/dovecot-1.2/rev/cf0da2cd31fb

Hi Timo,

Unfortunately this second changeset (cf0da2cd31fb) seems to have
introduced a bug that results ever other character being dropped from
strings before they are indexed. For instance, my username `egh`
becomes `eh`, `spam` becomes `sa`, `drafts` becomes `dat`,
etc. Furthermore I am not sure that the UTF-8 code is working as
expected. Attached is a patch which fixes the problem with every
second character being dropped & results in a solr index that can be
searched for unicode characters (at least I tested it with latin
accents and with greek)

best, Erik

Attachment: solr_unicode.diff
Description: Binary data

Attachment: pgp0Jy9nGY0lR.pgp
Description: PGP signature

Reply via email to