On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
I thought it was you, but wasn't sure.

I'm always looking for ways to minimize Term Vectors, because I consider excerpting/highlighting a core feature rather than an add- on, and they seem like such overkill. It bothers me that they duplicate so much information.

I've been toying with the idea of a hitCollector.collect(int docNum, float score, ScorePositions[] scorePositions) method -- or, more likely, a hitCollector.collect(Scorer scorer) method -- that would preserve each position that contributed to the score of a document and how much it contributed, allowing that information to be passed through a Hit object to the Highlighter.

That might be complemented storing the startOffsets and endOffsets for each field as streams of delta-encoded VInts along with the stored field data. Conceptually, it would be even cleaner to keep startOffsets and endOffsets in the postings...

a. <doc>+

b. <doc, boost>+

c. <doc, freq, <position>+ >+

d. <doc, freq, <position, boost>+ >+

e. <doc, freq, <position, boost, startOffset, endOffset>+ >+

... and pass *everything* the Highlighter needs to the Hit object. However, the offsets are never needed for scoring.

I would also like a way to store the frequency of the term in the overall collection (probably should go in the Term dictionary, but not sure, at the cost of an additional VInt per term, but I am open to other places to store it). Right now, in order to calculate this, one has to either store it separately at indexing time (using a term counting Filter) or calculate it at runtime by looping over the TermDocs and summing.

Sure, makes sense to me. Sounds like a custom codec you'd define. (The following code has been swiped and adapted from TermBuffer...)

public class CollFreqCodec extends TermDictionaryCodec {
  private collFreq;

  public void readRecord (IndexInput input, FieldInfos fieldInfos)
    throws IOException {
    this.term = null;                           // invalidate cache
    int start = input.readVInt();
    int length = input.readVInt();
    int totalLength = start + length;
    setBytesLength(totalLength);
    input.readBytes(this.bytes, start, length);
    this.field = fieldInfos.fieldName(input.readVInt());
    this.collFreq = input.readVInt();
  }
}

That's not quite right, because I'm envisioning a codec rather than a TermBuffer subclass, but maybe you get the idea.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to