On May 1, 2006, at 6:27 PM, jian chen wrote:
This way, for indexing new documents, the new Term(String text) is
called
and utf8bytes will be obtained from the input term text. For
segment term
info merge, the utf8bytes will be loaded from the Lucene index, which
already stores the term text as utf8 bytes. Therefore, no
conversion is
needed.
SegmentMerger will have to change to use bytes if bytecount-based
string header is going to achieve acceptable performace. Doug
pointed that out when I was about to throw in the towel because I
couldn't get things fast enough. Changing the implementation of Term
would have a very broad impact; I'd look for other ways to go about
it first. But I'm not an expert on SegmentMerger, as KinoSearch
doesn't use the same technique for merging.
My plan was to first submit a patch that made the change to the file
format but didn't touch SegmentMerger, then attack SegmentMerger and
also see if other developers could suggest optimizations.
However, I have an awful lot on my plate right now, and I basically
get paid to do KinoSearch-related work, but not Lucene-related work.
It's hard for me to break out the time to do the java coding,
especially since I don't have that much experience with java and I'm
slow. I'm not sure how soon I'll be able to get back to those
bytecount patches.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]