Bug#719858: codesearch: Indexer only accepts only valid UTF-8

Michael Stapelberg Fri, 16 Aug 2013 04:30:24 -0700

Hi Hilko,

Hilko Bengen <[email protected]> writes:
> BTW, I just tried passing 'äöü' as a Latin1-encoded string (bytes e4 f6
> fc) to csearch. This led to regexp/syntax failing with an "invalid
> UTF-8" error, so this does not work, even if the character encoding of
> the search term matches that of the index.
Yep, that is what I suspected. Only UTF-8 is supported.


> A "proper" solution would probably involve guessing the character set of
> a text file and convert it if necessary before indexing. Meh.
> How are you dealing with this in codesearch.debian.net?
I just assume everything is UTF-8. If it is not, and actually contains
non-ASCII characters, it needs to be converted to UTF-8. I mean,
common. This is 2013. Just convert it the files already! :-)

-- 
Best regards,
Michael


--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#719858: codesearch: Indexer only accepts only valid UTF-8

Reply via email to