On May 1, 2006, at 6:27 PM, jian chen wrote:

This way, for indexing new documents, the new Term(String text) is called and utf8bytes will be obtained from the input term text. For segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, no conversion is
needed.

SegmentMerger will have to change to use bytes if bytecount-based string header is going to achieve acceptable performace. Doug pointed that out when I was about to throw in the towel because I couldn't get things fast enough. Changing the implementation of Term would have a very broad impact; I'd look for other ways to go about it first. But I'm not an expert on SegmentMerger, as KinoSearch doesn't use the same technique for merging.

My plan was to first submit a patch that made the change to the file format but didn't touch SegmentMerger, then attack SegmentMerger and also see if other developers could suggest optimizations.

However, I have an awful lot on my plate right now, and I basically get paid to do KinoSearch-related work, but not Lucene-related work. It's hard for me to break out the time to do the java coding, especially since I don't have that much experience with java and I'm slow. I'm not sure how soon I'll be able to get back to those bytecount patches.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to