I guess this also ties in with 'getPositionIncrementGap', which is relevant to fields with multiple occurrences.
Peter On 7/27/07, Peter Keegan <[EMAIL PROTECTED]> wrote: > > I have a question about the way fields are analyzed and inverted by the > index writer. Currently, if a field has multiple occurrences in a document, > each occurrence is analyzed separately (see DocumentsWriter.processField). > Is it safe to assume that this behavior won't change in the future? The > reason I ask is that my custom analyzer's 'tokenStream' method creates a > custom filter which produces a payload based on the existence of each field > occurrence. However, if DocumentsWriter was changed and combined all the > occurrences before inversion, my scheme wouldn't work. Since payloads are > created by filters/tokenizers, it helps to keep things flexible. > > Thanks, > Peter > > > On 7/12/07, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > > > > > On Jul 12, 2007, at 6:12 PM, Chris Hostetter wrote: > > > > > > > > > > Hmm... okay so the issue is that in order to get the payload data, you > > > have to have a TermPositions instance. > > > > > > instead of adding getPayload methods to the Spans class (which as Paul > > > > > points out, can have nesting issues) perhaps more general solutions > > > would > > > be: > > > > > > a) a more high level getPayload API that let's you get a payload > > > arbitrarily for a toc/position (perhaps as part of the TernDocs > > > API?) ... > > > then for Spans you could use this new API with Spans.start() and > > > Spans.end(). (and all the positions in between) > > > > Not sure I follow this. I don't see the fit w/ TermDocs. > > > > > > b) add a variation of the TermPositions class to allow people to > > > iterate > > > through the terms of a TermDoc in position order (TermPosition first > > > iterates over the Terms and then over the positions) ... then you > > > could > > > seek(span.start()) to get the Payload data > > > > > > c) add methods to the Spans API to get the subspans (if any) ... this > > > would be the Spans corrilary to getTerms() and would always return > > > TermSpans which would have TermPositions for getting payload data. > > > > > > This could be a good alternative. > > > > When we first talked about payloads we wondered if we could just make > > all Queries into SpanQueries by passing TermPositions instead of term > > docs, but in the end decided not to do it because of performance > > issues (some of which are lessened by lazy loading of TermPositions. > > > > The thing is, I think, that the Spans is already moving you along in > > the term positions, so it just seems like a natural fit to have it > > there, even if there is nesting. It doesn't seem like it would be > > that hard to then return back the nesting stuff b/c you are just > > collating the results from the underlying SpanTermQuery. Having said > > that, I haven't looked into the actual code, so take that w/ a grain > > of salt. > > > > I will try to do some more investigation, as others are welcome to > > do. Perhaps we should move this to dev? > > > > Cheers, > > Grant > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > >
