Mike, I had another look at SegmentTermDocs.skipTo() and at SegmentTermPositions, and I think I'm beginning to get your point.
Could it be doable per skipInterval docs? Regards, Paul Elschot Op Monday 22 September 2008 19:24:38 schreef Michael McCandless: > OK, on closer inspection, I don't think this optimization will work, > unless I'm missing something... But it was a good idea, so keep em > coming! > > The TermInfo only stores proxPointer for each term, not per document > in the postings. This means the optimization could only apply if > there are no deleted docs in the posting, and the in & out formats > are congruent. Then we would move writing to proxOutput out of the > while loop in appendPostings to do a bulk copy of all bytes in the > proxStream for that one term & segment. > > But, there's a problem with that: we can't compute the skip pointer > as we write. The DefaultSkipListWriter looks at the proxOutput > pointer every skipInterval docs written and records the offset. If > we bulk-copy the prox bytes at the end we have no idea what the > offset is every skipInterval docs. > > Mike > > Paul Elschot wrote: > > Op Friday 19 September 2008 17:05:29 schreef Michael McCandless: > >> Not quite, because how positions are encoded depends on whether > >> any payload appeared in that segment. > >> > >> However, if 1) the input is a SegmentReader (since in general we > >> can merge any IndexReader), and 2) its format is "congruent" with > >> the format we are writing (ie both don't or do use the payloads > >> format), which ought to be true the vast majority of the time, > >> then I think we could simply copy bytes. Since the next TermInfo > >> tells us the proxPointer where it begins, we know exactly how many > >> bytes to copy. I think this'd be a nice optimization! > > > > I tried to find a way to do this, but I'm stuck at the point where > > the proxPointer is needed from a TermInfo. > > I got this far (uncompiled code, smi is the SegmentMergeInfo > > that is currently merged): > > > > if (smi.reader instanceof SegmentReader) { > > SegmentReader inputReader = smi.reader; > > boolean readerStorePayloads = > > inputReader.fieldInfos.fieldInfo(smi.term.field).storePayloads; > > if (storePayloads == readerStorePayloads) { > > // take the difference of the two prox pointers: > > int positionsLength = inputReader.tis. ... - ...; > > // do a direct byte copy from inputReader to proxOutput: > > ... ; > > } > > } > > > > but I could not find out how to get from the TermInfosReader > > at inputReader.tis to the next prox pointer. > > > > SegmentMerger never needs to index the positions by using a > > proxPointer itself, as it accesses all positions serially. This > > leaves me without an example on how to use proxPointer from a > > TermInfo. > > > > Any tips on how to continue? > > > > Regards, > > Paul Elschot > > > >> Mike > >> > >> Paul Elschot wrote: > >>> I'm looking at the for loop in SegmentMerger.java at line 666, > >>> which completely interprets the input positions/payloads for > >>> an input term at a document. > >>> > >>> The positions/payloads don't change when they merged, is that > >>> correct? I'm wondering whether this loop could be replaced by a > >>> direct copy from > >>> the input postings to proxOutput. > >>> > >>> Regards, > >>> Paul Elschot --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]