[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities
[ https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7730: - Attachment: LUCENE-7730.patch Here is an updated patch. I tried moving TFIDFSimilarity and ClassicSimilarity to misc yesterday but gave up due to how MLT and some values sources depend on it. This new patch removes the ability to customize the norms encoding in TFIDFSimilarity and makes it compatible again with other similarities (so I added it back to Similarity randomization). > Better encode length normalization in similarities > -- > > Key: LUCENE-7730 > URL: https://issues.apache.org/jira/browse/LUCENE-7730 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand > Attachments: LUCENE-7730.patch, LUCENE-7730.patch, LUCENE-7730.patch, > LUCENE-7730.patch > > > Now that index-time boosts are gone (LUCENE-6819) and that indices record the > version that was used to create them (for backward compatibility, > LUCENE-7703), we can look into storing the length normalization factor more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities
[ https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7730: - Attachment: LUCENE-7730.patch New patch. It is not really possible to switch ClassicSimilarity to the new encoding given how it is built on the assumption that it encodes the normalization factor directly while the new encoding I have been working on encodes the length. So I ended up doing the following: - ClassicSimilarity will still encode norms the same way in 7.0 as it did before, it means it is no longer index-time compatible with, say, BM25Similarity - ClassicSimilarity docs have been updated to advise using BM25Similarity instead - ClassicSimilarity has been moved out of similarity randomization in the test framework I'd like to get it in 7.0 as this change can only be done in a major release (it uses the index creation major to know which encoding to use) so please speak up if you have concerns. > Better encode length normalization in similarities > -- > > Key: LUCENE-7730 > URL: https://issues.apache.org/jira/browse/LUCENE-7730 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand > Attachments: LUCENE-7730.patch, LUCENE-7730.patch, LUCENE-7730.patch > > > Now that index-time boosts are gone (LUCENE-6819) and that indices record the > version that was used to create them (for backward compatibility, > LUCENE-7703), we can look into storing the length normalization factor more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities
[ https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7730: - Attachment: LUCENE-7730.patch Here is a new patch that builds upon LUCENE-7756. It is not 100% ready as some tests still don't pass due to the fact that I did not switch ClassicSimilarity to a new encoding but ready for review if anyone wants to have a look. > Better encode length normalization in similarities > -- > > Key: LUCENE-7730 > URL: https://issues.apache.org/jira/browse/LUCENE-7730 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand > Attachments: LUCENE-7730.patch, LUCENE-7730.patch > > > Now that index-time boosts are gone (LUCENE-6819) and that indices record the > version that was used to create them (for backward compatibility, > LUCENE-7703), we can look into storing the length normalization factor more > efficiently. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities
[ https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adrien Grand updated LUCENE-7730: - Attachment: LUCENE-7730.patch Here's a patch that does the following: - adds {{LeafReader.getIndexInfos()}} which bundles the index created version and the index sort in order to keep the number of methods on LeafReader contained. This way similarities can decide how to decode norms based on the created version. - adds indexCreatedVersion to {{FieldInvertedState}} so that similarities can decide how to encode norms based on the created version - Given that readers now know about their created version, I improved {{IndexWriter.addIndexes(CodecReader...)}} to fail when a reader what was created with a different version is added. - SimilarityBase and BM25Similarity now encode directly the length (rather than {{1/sqrt(length)}}) in a way that preserves 4 significant bits across the whole integer range and is accurate up to 40. ClassicSimilarity is left unmodified however. Here is a table of the encoded lengths for every possible byte. Everything works as if the lengths were rounded to the value in this table that is immediately lesser. || Byte & 0xff || Length || |0|0| |1|1| |2|2| |3|3| |4|4| |5|5| |6|6| |7|7| |8|8| |9|9| |10|10| |11|11| |12|12| |13|13| |14|14| |15|15| |16|16| |17|17| |18|18| |19|19| |20|20| |21|21| |22|22| |23|23| |24|24| |25|25| |26|26| |27|27| |28|28| |29|29| |30|30| |31|31| |32|32| |33|33| |34|34| |35|35| |36|36| |37|37| |38|38| |39|39| |40|40| |41|42| |42|44| |43|46| |44|48| |45|50| |46|52| |47|54| |48|56| |49|60| |50|64| |51|68| |52|72| |53|76| |54|80| |55|84| |56|88| |57|96| |58|104| |59|112| |60|120| |61|128| |62|136| |63|144| |64|152| |65|168| |66|184| |67|200| |68|216| |69|232| |70|248| |71|264| |72|280| |73|312| |74|344| |75|376| |76|408| |77|440| |78|472| |79|504| |80|536| |81|600| |82|664| |83|728| |84|792| |85|856| |86|920| |87|984| |88|1048| |89|1176| |90|1304| |91|1432| |92|1560| |93|1688| |94|1816| |95|1944| |96|2072| |97|2328| |98|2584| |99|2840| |100|3096| |101|3352| |102|3608| |103|3864| |104|4120| |105|4632| |106|5144| |107|5656| |108|6168| |109|6680| |110|7192| |111|7704| |112|8216| |113|9240| |114|10264| |115|11288| |116|12312| |117|13336| |118|14360| |119|15384| |120|16408| |121|18456| |122|20504| |123|22552| |124|24600| |125|26648| |126|28696| |127|30744| |128|32792| |129|36888| |130|40984| |131|45080| |132|49176| |133|53272| |134|57368| |135|61464| |136|65560| |137|73752| |138|81944| |139|90136| |140|98328| |141|106520| |142|114712| |143|122904| |144|131096| |145|147480| |146|163864| |147|180248| |148|196632| |149|213016| |150|229400| |151|245784| |152|262168| |153|294936| |154|327704| |155|360472| |156|393240| |157|426008| |158|458776| |159|491544| |160|524312| |161|589848| |162|655384| |163|720920| |164|786456| |165|851992| |166|917528| |167|983064| |168|1048600| |169|1179672| |170|1310744| |171|1441816| |172|1572888| |173|1703960| |174|1835032| |175|1966104| |176|2097176| |177|2359320| |178|2621464| |179|2883608| |180|3145752| |181|3407896| |182|3670040| |183|3932184| |184|4194328| |185|4718616| |186|5242904| |187|5767192| |188|6291480| |189|6815768| |190|7340056| |191|7864344| |192|8388632| |193|9437208| |194|10485784| |195|11534360| |196|12582936| |197|13631512| |198|14680088| |199|15728664| |200|16777240| |201|18874392| |202|20971544| |203|23068696| |204|25165848| |205|27263000| |206|29360152| |207|31457304| |208|33554456| |209|37748760| |210|41943064| |211|46137368| |212|50331672| |213|54525976| |214|58720280| |215|62914584| |216|6710| |217|75497496| |218|83886104| |219|92274712| |220|100663320| |221|109051928| |222|117440536| |223|125829144| |224|134217752| |225|150994968| |226|167772184| |227|184549400| |228|201326616| |229|218103832| |230|234881048| |231|251658264| |232|268435480| |233|301989912| |234|335544344| |235|369098776| |236|402653208| |237|436207640| |238|469762072| |239|503316504| |240|536870936| |241|603979800| |242|671088664| |243|738197528| |244|805306392| |245|872415256| |246|939524120| |247|1006632984| |248|1073741848| |249|1207959576| |250|1342177304| |251|1476395032| |252|1610612760| |253|1744830488| |254|1879048216| |255|2013265944| It is still a work-in-progress, some tests that rely on the way accuracy was lost are not passing for instance. Feedback about eg. better ways that we could propagate the index created version or encode the norm is welcome. > Better encode length normalization in similarities > -- > > Key: LUCENE-7730 > URL: https://issues.apache.org/jira/browse/LUCENE-7730 > Project: Lucene - Core > Issue Type: Task >Reporter: Adrien Grand > Attachments: LUCENE-7730.patch > > > Now that index-time boosts are gone (LUCENE-6819) and that indices record the > version that was used to