On Jun 2, 2006, at 6:48 AM, Grant Ingersoll wrote:
I thought it was you, but wasn't sure.
I'm always looking for ways to minimize Term Vectors, because I
consider excerpting/highlighting a core feature rather than an add-
on, and they seem like such overkill. It bothers me that they
duplicate so much information.
I've been toying with the idea of a hitCollector.collect(int docNum,
float score, ScorePositions[] scorePositions) method -- or, more
likely, a hitCollector.collect(Scorer scorer) method -- that would
preserve each position that contributed to the score of a document
and how much it contributed, allowing that information to be passed
through a Hit object to the Highlighter.
That might be complemented storing the startOffsets and endOffsets
for each field as streams of delta-encoded VInts along with the
stored field data. Conceptually, it would be even cleaner to keep
startOffsets and endOffsets in the postings...
a. <doc>+
b. <doc, boost>+
c. <doc, freq, <position>+ >+
d. <doc, freq, <position, boost>+ >+
e. <doc, freq, <position, boost, startOffset, endOffset>+ >+
... and pass *everything* the Highlighter needs to the Hit object.
However, the offsets are never needed for scoring.
I would also like a way to store the frequency of the term in the
overall collection (probably should go in the Term dictionary, but
not sure, at the cost of an additional VInt per term, but I am open
to other places to store it). Right now, in order to calculate
this, one has to either store it separately at indexing time (using
a term counting Filter) or calculate it at runtime by looping over
the TermDocs and summing.
Sure, makes sense to me. Sounds like a custom codec you'd define.
(The following code has been swiped and adapted from TermBuffer...)
public class CollFreqCodec extends TermDictionaryCodec {
private collFreq;
public void readRecord (IndexInput input, FieldInfos fieldInfos)
throws IOException {
this.term = null; // invalidate cache
int start = input.readVInt();
int length = input.readVInt();
int totalLength = start + length;
setBytesLength(totalLength);
input.readBytes(this.bytes, start, length);
this.field = fieldInfos.fieldName(input.readVInt());
this.collFreq = input.readVInt();
}
}
That's not quite right, because I'm envisioning a codec rather than a
TermBuffer subclass, but maybe you get the idea.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]