[ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7730:
---------------------------------
    Attachment: LUCENE-7730.patch

Here's a patch that does the following:
 - adds {{LeafReader.getIndexInfos()}} which bundles the index created version 
and the index sort in order to keep the number of methods on LeafReader 
contained. This way similarities can decide how to decode norms based on the 
created version.
 - adds indexCreatedVersion to {{FieldInvertedState}} so that similarities can 
decide how to encode norms based on the created version
 - Given that readers now know about their created version, I improved 
{{IndexWriter.addIndexes(CodecReader...)}} to fail when a reader what was 
created with a different version is added.
 - SimilarityBase and BM25Similarity now encode directly the length (rather 
than {{1/sqrt(length)}}) in a way that preserves 4 significant bits across the 
whole integer range and is accurate up to 40. ClassicSimilarity is left 
unmodified however.

Here is a table of the encoded lengths for every possible byte. Everything 
works as if the lengths were rounded to the value in this table that is 
immediately lesser.
|| Byte & 0xff || Length ||
|0|0|
|1|1|
|2|2|
|3|3|
|4|4|
|5|5|
|6|6|
|7|7|
|8|8|
|9|9|
|10|10|
|11|11|
|12|12|
|13|13|
|14|14|
|15|15|
|16|16|
|17|17|
|18|18|
|19|19|
|20|20|
|21|21|
|22|22|
|23|23|
|24|24|
|25|25|
|26|26|
|27|27|
|28|28|
|29|29|
|30|30|
|31|31|
|32|32|
|33|33|
|34|34|
|35|35|
|36|36|
|37|37|
|38|38|
|39|39|
|40|40|
|41|42|
|42|44|
|43|46|
|44|48|
|45|50|
|46|52|
|47|54|
|48|56|
|49|60|
|50|64|
|51|68|
|52|72|
|53|76|
|54|80|
|55|84|
|56|88|
|57|96|
|58|104|
|59|112|
|60|120|
|61|128|
|62|136|
|63|144|
|64|152|
|65|168|
|66|184|
|67|200|
|68|216|
|69|232|
|70|248|
|71|264|
|72|280|
|73|312|
|74|344|
|75|376|
|76|408|
|77|440|
|78|472|
|79|504|
|80|536|
|81|600|
|82|664|
|83|728|
|84|792|
|85|856|
|86|920|
|87|984|
|88|1048|
|89|1176|
|90|1304|
|91|1432|
|92|1560|
|93|1688|
|94|1816|
|95|1944|
|96|2072|
|97|2328|
|98|2584|
|99|2840|
|100|3096|
|101|3352|
|102|3608|
|103|3864|
|104|4120|
|105|4632|
|106|5144|
|107|5656|
|108|6168|
|109|6680|
|110|7192|
|111|7704|
|112|8216|
|113|9240|
|114|10264|
|115|11288|
|116|12312|
|117|13336|
|118|14360|
|119|15384|
|120|16408|
|121|18456|
|122|20504|
|123|22552|
|124|24600|
|125|26648|
|126|28696|
|127|30744|
|128|32792|
|129|36888|
|130|40984|
|131|45080|
|132|49176|
|133|53272|
|134|57368|
|135|61464|
|136|65560|
|137|73752|
|138|81944|
|139|90136|
|140|98328|
|141|106520|
|142|114712|
|143|122904|
|144|131096|
|145|147480|
|146|163864|
|147|180248|
|148|196632|
|149|213016|
|150|229400|
|151|245784|
|152|262168|
|153|294936|
|154|327704|
|155|360472|
|156|393240|
|157|426008|
|158|458776|
|159|491544|
|160|524312|
|161|589848|
|162|655384|
|163|720920|
|164|786456|
|165|851992|
|166|917528|
|167|983064|
|168|1048600|
|169|1179672|
|170|1310744|
|171|1441816|
|172|1572888|
|173|1703960|
|174|1835032|
|175|1966104|
|176|2097176|
|177|2359320|
|178|2621464|
|179|2883608|
|180|3145752|
|181|3407896|
|182|3670040|
|183|3932184|
|184|4194328|
|185|4718616|
|186|5242904|
|187|5767192|
|188|6291480|
|189|6815768|
|190|7340056|
|191|7864344|
|192|8388632|
|193|9437208|
|194|10485784|
|195|11534360|
|196|12582936|
|197|13631512|
|198|14680088|
|199|15728664|
|200|16777240|
|201|18874392|
|202|20971544|
|203|23068696|
|204|25165848|
|205|27263000|
|206|29360152|
|207|31457304|
|208|33554456|
|209|37748760|
|210|41943064|
|211|46137368|
|212|50331672|
|213|54525976|
|214|58720280|
|215|62914584|
|216|67108888|
|217|75497496|
|218|83886104|
|219|92274712|
|220|100663320|
|221|109051928|
|222|117440536|
|223|125829144|
|224|134217752|
|225|150994968|
|226|167772184|
|227|184549400|
|228|201326616|
|229|218103832|
|230|234881048|
|231|251658264|
|232|268435480|
|233|301989912|
|234|335544344|
|235|369098776|
|236|402653208|
|237|436207640|
|238|469762072|
|239|503316504|
|240|536870936|
|241|603979800|
|242|671088664|
|243|738197528|
|244|805306392|
|245|872415256|
|246|939524120|
|247|1006632984|
|248|1073741848|
|249|1207959576|
|250|1342177304|
|251|1476395032|
|252|1610612760|
|253|1744830488|
|254|1879048216|
|255|2013265944|

It is still a work-in-progress, some tests that rely on the way accuracy was 
lost are not passing for instance. Feedback about eg. better ways that we could 
propagate the index created version or encode the norm is welcome.

> Better encode length normalization in similarities
> --------------------------------------------------
>
>                 Key: LUCENE-7730
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7730
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Adrien Grand
>         Attachments: LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to create them (for backward compatibility, 
> LUCENE-7703), we can look into storing the length normalization factor more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to