"Karl Koch" <[EMAIL PROTECTED]> writes:

> I am not sure if I know exactly what pivoted normalisation is. I can tell
> you what I do, in the meantime I will have a look to your paper and I hope
> that we can discuss this issue further.

Sort answer on pivoted document length normalization.  You'll notice
that the Lucene scoring function includes a normalization for document
length.  This is because, in general, just using tf and idf will
result in a bias towards long documents, which contain more terms.

The standard cosine normalization controls for this, but too
much... if you plot the probability of retrieval against the
probability of relevance (using a test collection), you can see that
cosine is too biased towards short documents.

Pivoted normalization learns a scaling factor to correct for this.  In
the original formulation (Sighal et al, SIGIR '96) the length was
based on words in the document, but later the byte length was used.

Occasionally, you will see people blindly using pdl constants from
some paper in their own collection without actually trying to measure
what they should be.  This is likely to screw things up.

Ian



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to