12 dec 2005 kl. 16.40 skrev karl wettin:
Hello list,
I'm looking for a way to change character encoding per index. It
feels silly to store chinese characters in 3 bytes using UTF-8 when
it is possible to do it with 2 bytes using UTF-16. By just hacking
the IndexInput and IndexOutput I quick and dirty got it all running
in UTF-16, but this is not good enough since I have other indexes
that is more optimized when encoded in UTF-8.
The character encoding of Lucene today is quite static. In order to
select encoding it seems to me I have to do some major refactoring
to the project, passing a character codec from my analyzer (or
perhaps IndexWriter/Reader) all the way down to the IndexInput/
Output via TermVector/Info, et.c.
Can someone think of a better way to set character encoding per
index? Or perhaps some other thought?
My current thought is to extend Directory
(CharacterEncodingAwareDirectory or so) and all implementations of it
to intercept the create/openFile methods and add a character encoding
strategy to the IndexInput/Output.
Is there a reason for the write/readCharacters in IndexInput/Output
to be final?
--
karl
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]