[ https://issues.apache.org/jira/browse/LUCENE-1420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12639668#action_12639668 ]
Hoss Man commented on LUCENE-1420: ---------------------------------- 1) i only skimmed this quickly, but i don't think the changes to SweetSpotSimilarity are back compatible ... setLengthNormFactors has a new arg list. 2) ditto for the public "Info" constructor in MemoryIndex.java 3) as long as we are adding a new lengthNorm method that has access to new data about the stream, would it also make sense to pass in fieldState.position? and/or a new count of hte number of times getPositionIncrementGap(fieldInfo.name) is called? Those also seem like they could be useful, and should be just as cheap to keep track of as numOverlap and length. (this occured to me because of recent threads on solr-user asking about lengthNorm and multivalued fields ... there may only be one fieldNorm per field name, but with stats like that we could at least do some interesting things based on the average length of each field value. 4) independent of #3, we may want to consider making FieldInvertState a public class and passing it directly to lengthNorm ... that way lengthNorm can utilize whatever data it wants, and we can add more available data later without changing the API again. We could even deprecate lengthNorm entirely and add a new FieldInvertState.norm property that a new Similarity.computeNorm(FieldInvertState) could set directly so it could choose to ignore the doc & field boosts altogether if it wanted to. > Similarity.lengthNorm and positionIncrement=0 > --------------------------------------------- > > Key: LUCENE-1420 > URL: https://issues.apache.org/jira/browse/LUCENE-1420 > Project: Lucene - Java > Issue Type: Improvement > Components: Index > Affects Versions: 2.3.3, 2.9 > Reporter: Andrzej Bialecki > Assignee: Michael McCandless > Fix For: 2.3.3, 2.9 > > Attachments: similarity.patch > > > Calculation of lengthNorm factor should in some cases take into account the > number of tokens with positionIncrement=0. This should be made optional, to > support two different scenarios: > * when analyzers insert artificially constructed tokens into TokenStream > (e.g. ASCII-fied versions of accented terms, stemmed terms), and it's > unlikely that users submit queries containing both versions of tokens: in > this case lengthNorm calculation should ignore the tokens with > positionIncrement=0. > * when analyzers insert synonyms, and it's likely that users may submit > queries that contain multiple synonymous terms: in this case the lengthNorm > should be calculated as it is now, i.e. it should take into account all terms > no matter what is their positionIncrement. > The default should be backward-compatible, i.e. it should count all tokens. > (See also the discussion here: http://markmail.org/message/vfvmzrzhr6pya22h ) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]