Re: Optimizing SegmentTermEnum (and friends)

Dmitry Serebrennikov Tue, 25 Feb 2003 15:26:50 -0800

Thanks for your reply, Doug. See blow.

Doug Cutting wrote:

Dmitry Serebrennikov wrote:

1) Since I do not need the intermediate terms, it makes sence to try to have a method that skips to the right term without creating the intermediate Term objects. I have done a version of this yesterday and ended up seeing a factor of 2 performance encrease and a factor of 2 garbage reduction. The patch adds the following method to Term.java: final int compareTo(String otherField, char[] otherText, int start, int len) And changes SegmentTermEnum.java to delay creation of Term object until call to term(). Full diff is attached. Any comments are welcome, especially if I've missed something.

Looks reasonable to me. Does it still pass all of the unit tests?

Have not had a chance to run them. I will report results once I do.

/** Returns the TermInfo for a Term in the set, or null. */ final synchronized TermInfo get(Term term) throws IOException { if (size == 0) return null; // optimize sequential access: first try scanning cached enum w/o seeking if (enum.term() != null // term is at or past current && ((enum.prev != null && term.compareTo(enum.prev) > 0) || term.compareTo(enum.term()) >= 0)) { int enumOffset = (enum.position/TermInfosWriter.INDEX_INTERVAL)+1; if (indexTerms.length == enumOffset // but before end of block || term.compareTo(indexTerms[enumOffset]) < 0) return scanEnum(term); // no need to seek } // random-access: must seek seekEnum(getIndexOffset(term)); return scanEnum(term); }

If you put a print statement in this and run the unit tests you'll see that this optimization fires a lot. If, e.g., one expands a wildcarded string into a bunch of terms, which are near one another in the enum, then subsequently asks for the frequency of each term (to weight it in a query), and then, in a third pass, ask for its TermDocs, then each of these latter passes benefit from this optimization. So let's not lose it.

I know that the optimizaion as a whole is important, but I was curious to know how important was the use of .prev variable here. In order to maintain this variable, SegmentsTermEnum is forced to create Term objects that could otherwise be avoided. If I read this code correctly, the optimization kicks in when enum has a current term && (( enum remembers previous term && that term is less than the target term ) || the current term is less or equal to the target term ) The only time the value of the .prev variable is significant is when the enum has a current term but that term is greater than the target. If at the same time enum also remembers the previous term and that term is less that the target, the optimization is enabled.

Oh, I see, this is important when the target term is not in the enum... There's got to be a better way to implement this that does not require copying the buffer in the SegmentsTermEnum.

Dmitry.

Doug


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Optimizing SegmentTermEnum (and friends)

Reply via email to