Re: storing term text internally as byte array and bytecount as prefix, etc.

Marvin Humphrey Mon, 01 May 2006 19:09:17 -0700

On May 1, 2006, at 6:27 PM, jian chen wrote:

This way, for indexing new documents, the new Term(String text) iscalledand utf8bytes will be obtained from the input term text. Forsegment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, noconversion is
needed.

SegmentMerger will have to change to use bytes if bytecount-basedstring header is going to achieve acceptable performace. Dougpointed that out when I was about to throw in the towel because Icouldn't get things fast enough. Changing the implementation of Termwould have a very broad impact; I'd look for other ways to go aboutit first. But I'm not an expert on SegmentMerger, as KinoSearchdoesn't use the same technique for merging.

My plan was to first submit a patch that made the change to the fileformat but didn't touch SegmentMerger, then attack SegmentMerger andalso see if other developers could suggest optimizations.

However, I have an awful lot on my plate right now, and I basicallyget paid to do KinoSearch-related work, but not Lucene-related work.It's hard for me to break out the time to do the java coding,especially since I don't have that much experience with java and I'mslow. I'm not sure how soon I'll be able to get back to thosebytecount patches.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to