Hello list,
I'm looking for a way to change character encoding per index. It
feels silly to store chinese characters in 3 bytes using UTF-8 when
it is possible to do it with 2 bytes using UTF-16. By just hacking
the IndexInput and IndexOutput I quick and dirty got it all running
in UTF-16, but this is not good enough since I have other indexes
that is more optimized when encoded in UTF-8.
The character encoding of Lucene today is quite static. In order to
select encoding it seems to me I have to do some major refactoring to
the project, passing a character codec from my analyzer (or perhaps
IndexWriter/Reader) all the way down to the IndexInput/Output via
TermVector/Info, et.c.
Can someone think of a better way to set character encoding per
index? Or perhaps some other thought?
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]