Allow TfIdfSimilarity to encode norms in more than a byte

Shai Erera Tue, 25 Jun 2013 05:55:39 -0700

Hi

Today TfIdfSimilarity forces the encoding of norms into a single byte, and
there's no way to override it. E.g. if I don't want to lose precision, the
only thing I can do is write a different Similarity while copying most of
the code from TfIdfSimilarity.


I looked at the code and I was wondering if we cannot move this byte
encoding to DefaultSimilarity, while making encodeNorm and decodeNorm
take/return long values, as eventually Sim.computeNorm returns a long value.

That way, DefaultSim will document the byte-encoding it uses (move the
respective jdocs from TfIdfSim to Default), while allowing to implement
TfIdfSim without losing that much precision, as well as not copying the
entire class.

While we're at it, I read LUCENE-1907 about queryNorm and I think this is
important too. So whether we introduce a queryNorm or make computeWeight
non-final, to allow returning other SimWeight impls, I think we should
allow that extension.

I was about to open a JIRA issue to handle that until I noticed the
explicit documentation in TfIdfSim (re the byte encoding), so thought I'd
ask first if anybody objects.

Back-compat wise, we will need to break it (by changing encode/decodeNorm
signatures), but I think that's ok in this case. Overriding TfIdfSim is
expert enough IMO.

Shai

Allow TfIdfSimilarity to encode norms in more than a byte

Reply via email to