RE: synonyms

Michael Garski Tue, 19 Jan 2010 09:51:49 -0800

I've posted the patch to the Lucene.Net JIRA under issue LUCENENET-337 
[https://issues.apache.org/jira/browse/LUCENENET-337].  If you have any 
questions/issues/concerns, please keep the discussion along with the JIRA issue.


While the patch prevents the undesired increase in the length norm during 
synonym injection, there are other things you can do with Payloads to further 
adjust the scoring.  If you desire the ability to score a match against a 
synonym differently than a match against the original text you can add a 
payload marking the term as either original text or a synonym and create your 
own PayloadFunction to score the document accordingly.  I've not tried this yet 
- it's on my list of things to experiment with.

Michael

-----Original Message-----
From: Michael Garski 
Sent: Saturday, January 16, 2010 7:53 PM
To: [email protected]
Subject: Re: synonyms

Artem,

I made the changes for 2.9 and can create a patch that can be applied  
against the trunk. I'll create a JIRA issue and post the patch along  
with some sample code on how to use it when I am back in the office  
next on Tuesday.

I don't see this patch being committed to the trunk as it does alter  
the internal behavior slightly, but sitting in the contrib section.

Michael


On Jan 16, 2010, at 6:34 PM, "Artem Chereisky" <[email protected]>  
wrote:

> now sending to lucene.apache.org
>
> ---------- Forwarded message ----------
> From: Artem Chereisky <[email protected]>
> Date: Sun, Jan 17, 2010 at 1:30 PM
> Subject: Re: synonyms
> To: [email protected]
> Cc: [email protected]
>
>
> Hi Michael,
>
> I refer to a thread between the two of us about a month and a half  
> ago when
> you helped me with lengthNorm for synonyms. It required a change to  
> Lucene
> core as per below. I implemented the change and it worked great for  
> me.
>
> I'm now in the middle of moving to 2.9 and that particular change  
> got me
> stuck. I'm wondering if you had to deal with the same issue and if  
> yes, how
> did you manage it?
>
> Here's the issue:
> In 2.4 there was this line of code
>  Token token = perThread.localToken.Reinit(stringValue,  
> fieldState.offset,
> fieldState.offset + valueLength);
>
> In 2.9 it's changed to
>                        perThread.singleTokenTokenStream.Reinit 
> (stringValue,
> 0, valueLength);
> and it doesn't return Token
>
> I can't see how I can get hold of Token to implement the same logic  
> further
> down
>
>                        if (token.IncludeInFieldLength)
>                        {
>                            fieldState.length++;
>                        }
>
> Any help would be appreciated.
>
> Regards,
> Art
>
>
>
> On Fri, Dec 18, 2009 at 12:11 PM, Artem Chereisky <[email protected] 
> >wrote:
>
>> Thank you, Michael. You've been helpful as always.
>>
>> Art
>> -a
>>
>>
>> On 18/12/2009, at 6:06, Michael Garski <[email protected]>  
>> wrote:
>>
>> Artem,
>>>
>>> Here's a description of the change made to allow for customizing the
>>> length norm  when indexing synonyms.  I do not have a patch  
>>> available
>>> for it at this time.  While I made the change for 2.4 the same  
>>> approach
>>> could be taken in 2.9 however there may a better of implementing it
>>> using Attributes however I have not yet investigated this approach.
>>>
>>> I added an bool property to Token named 'IncludeInFieldLength' that
>>> defaulted to true, then in a custom analyzer if I did not want a  
>>> Token
>>> to count towards the field length I would set the value to false.
>>> Within the DocInverterPerField class I altered the internals of
>>> processFields(Fieldable[] fields, int count) to only increment the  
>>> value
>>> of fieldState.length if the 'IncludeInFieldLength' property on the  
>>> token
>>> is set to true.
>>>
>>> I made the change to handle the same use case you have - synonym
>>> injection - and it worked great.
>>>
>>> Michael
>>>
>>> -----Original Message-----
>>> From: Artem Chereisky [mailto:[email protected]]
>>> Sent: Wednesday, December 16, 2009 5:30 PM
>>> To: [email protected]
>>> Subject: synonyms
>>>
>>> Hi Everyone,
>>>
>>> I implemented synonyms using SynonymFilter and SynonymTree classes  
>>> which
>>> I
>>> ported from Java. The solution supports multi-word synonyms and it  
>>> seems
>>> to
>>> work fine.
>>>
>>> One problem with this approach is, although synonyms are at the same
>>> position in the index, each gets counted towards the total number of
>>> terms.
>>> That adversely affects lengthNorm. Michael Garski mentioned  
>>> earlier that
>>> he
>>> came across a similar issue and solved it. Am I correct Michael.  
>>> If so,
>>> could you share your approach, please?
>>>
>>> Synonyms is a fairly standard feature. Is there a 'best practice'
>>> solution?
>>>
>>> Regards,
>>> Art
>>>
>>>

RE: synonyms

Reply via email to