I've posted the patch to the Lucene.Net JIRA under issue LUCENENET-337 [https://issues.apache.org/jira/browse/LUCENENET-337]. If you have any questions/issues/concerns, please keep the discussion along with the JIRA issue.
While the patch prevents the undesired increase in the length norm during synonym injection, there are other things you can do with Payloads to further adjust the scoring. If you desire the ability to score a match against a synonym differently than a match against the original text you can add a payload marking the term as either original text or a synonym and create your own PayloadFunction to score the document accordingly. I've not tried this yet - it's on my list of things to experiment with. Michael -----Original Message----- From: Michael Garski Sent: Saturday, January 16, 2010 7:53 PM To: [email protected] Subject: Re: synonyms Artem, I made the changes for 2.9 and can create a patch that can be applied against the trunk. I'll create a JIRA issue and post the patch along with some sample code on how to use it when I am back in the office next on Tuesday. I don't see this patch being committed to the trunk as it does alter the internal behavior slightly, but sitting in the contrib section. Michael On Jan 16, 2010, at 6:34 PM, "Artem Chereisky" <[email protected]> wrote: > now sending to lucene.apache.org > > ---------- Forwarded message ---------- > From: Artem Chereisky <[email protected]> > Date: Sun, Jan 17, 2010 at 1:30 PM > Subject: Re: synonyms > To: [email protected] > Cc: [email protected] > > > Hi Michael, > > I refer to a thread between the two of us about a month and a half > ago when > you helped me with lengthNorm for synonyms. It required a change to > Lucene > core as per below. I implemented the change and it worked great for > me. > > I'm now in the middle of moving to 2.9 and that particular change > got me > stuck. I'm wondering if you had to deal with the same issue and if > yes, how > did you manage it? > > Here's the issue: > In 2.4 there was this line of code > Token token = perThread.localToken.Reinit(stringValue, > fieldState.offset, > fieldState.offset + valueLength); > > In 2.9 it's changed to > perThread.singleTokenTokenStream.Reinit > (stringValue, > 0, valueLength); > and it doesn't return Token > > I can't see how I can get hold of Token to implement the same logic > further > down > > if (token.IncludeInFieldLength) > { > fieldState.length++; > } > > Any help would be appreciated. > > Regards, > Art > > > > On Fri, Dec 18, 2009 at 12:11 PM, Artem Chereisky <[email protected] > >wrote: > >> Thank you, Michael. You've been helpful as always. >> >> Art >> -a >> >> >> On 18/12/2009, at 6:06, Michael Garski <[email protected]> >> wrote: >> >> Artem, >>> >>> Here's a description of the change made to allow for customizing the >>> length norm when indexing synonyms. I do not have a patch >>> available >>> for it at this time. While I made the change for 2.4 the same >>> approach >>> could be taken in 2.9 however there may a better of implementing it >>> using Attributes however I have not yet investigated this approach. >>> >>> I added an bool property to Token named 'IncludeInFieldLength' that >>> defaulted to true, then in a custom analyzer if I did not want a >>> Token >>> to count towards the field length I would set the value to false. >>> Within the DocInverterPerField class I altered the internals of >>> processFields(Fieldable[] fields, int count) to only increment the >>> value >>> of fieldState.length if the 'IncludeInFieldLength' property on the >>> token >>> is set to true. >>> >>> I made the change to handle the same use case you have - synonym >>> injection - and it worked great. >>> >>> Michael >>> >>> -----Original Message----- >>> From: Artem Chereisky [mailto:[email protected]] >>> Sent: Wednesday, December 16, 2009 5:30 PM >>> To: [email protected] >>> Subject: synonyms >>> >>> Hi Everyone, >>> >>> I implemented synonyms using SynonymFilter and SynonymTree classes >>> which >>> I >>> ported from Java. The solution supports multi-word synonyms and it >>> seems >>> to >>> work fine. >>> >>> One problem with this approach is, although synonyms are at the same >>> position in the index, each gets counted towards the total number of >>> terms. >>> That adversely affects lengthNorm. Michael Garski mentioned >>> earlier that >>> he >>> came across a similar issue and solved it. Am I correct Michael. >>> If so, >>> could you share your approach, please? >>> >>> Synonyms is a fairly standard feature. Is there a 'best practice' >>> solution? >>> >>> Regards, >>> Art >>> >>>
