Re: Lucene and UTF-8

Marvin Humphrey Wed, 21 Sep 2005 13:45:57 -0700


On Sep 21, 2005, at 12:25 PM, Yonik Seeley wrote:

How does this patch work w.r.t. the length vint?

It looks like the length is still the number of 16 bit java chars,
but the encoding is now correct UTF-8?

Yes. As Ken Krugler pointed out to me, the issues can be separated.The length VInt can be changed now or in the future.

There may be lots of reasons to change the length VInt to use bytes;IIRC, you were one of the people inclined in that direction.(Another possibility was to use UTF-8 characters, but there doesn'tseem to be any advantage in going that route besides aestheticharmony.) The decision to change it or not to change it will have tobe taken after a festive round of benchmarking.

If nobody steps up to do that benchmarking, I'll probably try tokickstart the discussion with a little of my own, as it would be muchbetter for the Perl side to use bytes as the length VInt, noquestion. But since I'm basically an army of one working the Perlangle right now, it would be great if I didn't have to stretch myselfeven thinner doing benchmarking in Java when there are a lot morepeople with a lot more expertise who can take that on.

Perl development is going very well, by the way. On the indexingside, I've got a new app going which solves both the indexcompatibility issue and the speed issue, about which I'll make apresentation in this forum after I flesh it out and clean it up.

Well, I'm lying a little. The app doesn't quite write a valid Lucene1.4.3 index, since it writes true UTF-8. If these patches getadopted prior to the release of 1.9, though, it will write validLucene 1.9 indexes.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Lucene and UTF-8

Reply via email to