[jira] Commented: (LUCENE-1420) Similarity.lengthNorm and positionIncrement=0

Hoss Man (JIRA) Tue, 14 Oct 2008 17:29:45 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639668#action_12639668
 ]


Hoss Man commented on LUCENE-1420:
----------------------------------

1) i only skimmed this quickly, but i don't think the changes to 
SweetSpotSimilarity are back compatible ... setLengthNormFactors has a new arg 
list.

2) ditto for the public "Info" constructor in MemoryIndex.java

3) as long as we are adding a new lengthNorm method that has access to new data 
about the stream, would it also make sense to pass in fieldState.position?  
and/or a new count of hte number of times 
getPositionIncrementGap(fieldInfo.name) is called?  Those also seem like they 
could be useful, and should be just as cheap to keep track of as numOverlap and 
length.  (this occured to me because of recent threads on solr-user asking 
about lengthNorm and multivalued fields ... there may only be one fieldNorm per 
field name, but with stats like that we could at least do some interesting 
things based on the average length of each field value.

4) independent of #3, we may want to consider making FieldInvertState a public 
class and passing it directly to lengthNorm ... that way lengthNorm can utilize 
whatever data it wants, and we can add more available data later without 
changing the API again.  We could even deprecate lengthNorm entirely and add a 
new FieldInvertState.norm property that a new 
Similarity.computeNorm(FieldInvertState) could set directly so it could choose 
to ignore the doc & field boosts altogether if it wanted to.

> Similarity.lengthNorm and positionIncrement=0
> ---------------------------------------------
>
>                 Key: LUCENE-1420
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1420
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Index
>    Affects Versions: 2.3.3, 2.9
>            Reporter: Andrzej Bialecki 
>            Assignee: Michael McCandless
>             Fix For: 2.3.3, 2.9
>
>         Attachments: similarity.patch
>
>
> Calculation of lengthNorm factor should in some cases take into account the 
> number of tokens with positionIncrement=0. This should be made optional, to 
> support two different scenarios:
> * when analyzers insert artificially constructed tokens into TokenStream 
> (e.g. ASCII-fied versions of accented terms, stemmed terms), and it's 
> unlikely that users submit queries containing both versions of tokens: in 
> this case lengthNorm calculation should ignore the tokens with 
> positionIncrement=0.
> * when analyzers insert synonyms, and it's likely that users may submit 
> queries that contain multiple synonymous terms: in this case the lengthNorm 
> should be calculated as it is now, i.e. it should take into account all terms 
> no matter what is their positionIncrement.
> The default should be backward-compatible, i.e. it should count all tokens.
> (See also the discussion here: http://markmail.org/message/vfvmzrzhr6pya22h )

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1420) Similarity.lengthNorm and positionIncrement=0

Reply via email to