On Sat, May 06, 2006 at 05:11:02PM +0900, David Balmain wrote: > Hi Marvin, > > Where are you with this? I also have a vested interest in seeing > Lucene move to using byte counts. I was wondering if I could help out. > Is the patch you pasted here the latest you have?
All I've added since then is debugging code. Including some last night. As I mentioned in another thread, this is going to be a multi-stage process. The goal of that first patch is to have Lucene using bytecounts everywhere (except for TermVectors, just because it isn't strictly necessary). Lucene will be slower after it is [fixed, completed and] applied. The next stage will involve finding optimizations to return Lucene to at least its prior speed. The primary target is segment merger. Looking ahead, it will be interesting to see how many advantages of working with term text as bytestrings can be realized. Lazy loading of fields should be an obvious winner. The cached .tii in TermInfosReader could potentially occupy a lot less RAM if your text takes up less space in UTF-8 than in chars. And it becomes theoretically possible to have Lucene use an arbitrary encoding for character data in the index, rather than only UTF-8. The intended mechanics of that patch should be plain enough. I'm going to take another crack at seeing what's wrong with it today. If somebody beats me to a solution, I won't complain. :) Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]