On Sep 21, 2005, at 12:25 PM, Yonik Seeley wrote:
How does this patch work w.r.t. the length vint? It looks like the length is still the number of 16 bit java chars, but the encoding is now correct UTF-8?
Yes. As Ken Krugler pointed out to me, the issues can be separated. The length VInt can be changed now or in the future.
There may be lots of reasons to change the length VInt to use bytes; IIRC, you were one of the people inclined in that direction. (Another possibility was to use UTF-8 characters, but there doesn't seem to be any advantage in going that route besides aesthetic harmony.) The decision to change it or not to change it will have to be taken after a festive round of benchmarking.
If nobody steps up to do that benchmarking, I'll probably try to kickstart the discussion with a little of my own, as it would be much better for the Perl side to use bytes as the length VInt, no question. But since I'm basically an army of one working the Perl angle right now, it would be great if I didn't have to stretch myself even thinner doing benchmarking in Java when there are a lot more people with a lot more expertise who can take that on.
Perl development is going very well, by the way. On the indexing side, I've got a new app going which solves both the index compatibility issue and the speed issue, about which I'll make a presentation in this forum after I flesh it out and clean it up.
Well, I'm lying a little. The app doesn't quite write a valid Lucene 1.4.3 index, since it writes true UTF-8. If these patches get adopted prior to the release of 1.9, though, it will write valid Lucene 1.9 indexes.
Marvin Humphrey Rectangular Research http://www.rectangular.com/ --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]