Re: Distinct terms within a document for new Similarity class

Romaric Pighetti Tue, 21 May 2019 06:28:45 -0700

Hi,

Thanks Adrien for the quick and accurate answer.

Digging into the implementation I saw that the document length isalready stored there and as I need both the unique term count and thelength, I can't just replace one with the other.The Similarity class documentation states that it is possible to storeadditional values using NumericDocValuesField that could be accessed atquery time using a LeafReader.

From my understanding, using the LeafReaderContext when building theSimScorer should allow me to get access to the NumericDocValuesField.

The problem is I don't get how to create and store a newNumericDocValuesField from the Similarity. My guess is that it shouldhappen within the computeNorm function again as it is the only functioncalled at indexing time. However I am unable to understand how to createand store this information from that function.


If you have any advice that would be really helpful.

Thanks.
Romaric

Le 20/05/2019 à 12:16, Adrien Grand a écrit :

Hi Romaric,

You are right, computeNorm is the right place to compute and record
the number of unique terms of a document. Your computeNorm function
would look something like this:

@Override
public final long computeNorm(FieldInvertState state) {
   return SmallFloat.intToByte4(state.getUniqueTermCount());
}

And then in your scorer you could convert the norm back to the unique
term count by doing SmallFloat.byte4ToInt on it. The SmallFloat
methods are useful to encode this count on one byte, which trades some
accuracy but is usually the right trade-off.

On Mon, May 20, 2019 at 12:05 PM Romaric Pighetti
<romaric.pighe...@francelabs.com> wrote:

Hi,

I am currently implementing a new similarity class into lucene which is
based on a language model with absolute discount.
I am basing my work on the work already done in the
LMDirichletSimilarity and LMJelinekMercerSimilarity which are really close.
However to end my implementation I need to get the number of unique
terms present in the document, and this information seems to be
unavailable natively from within the score function.

The computeNorm function which is in the Similarity class seems to be
the right place to compute (or read) and store this statistic but I am
not sure.
So I am reaching you to know if I am on the right track and if you have
any advice on how I could access this statistic from the computeNorm
function if possible ?

I would like the implementation to be as clean as possible with regards
to Lucene's code expectation to be able to submit it for integration
once it is done.

Thanks for your help,
Regards.

--
Romaric Pighetti
R&D - FranceLabs


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

--
Romaric Pighetti
R&D - FranceLabs


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Re: Distinct terms within a document for new Similarity class

Reply via email to