[
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Adrien Grand updated LUCENE-7730:
---------------------------------
Attachment: LUCENE-7730.patch
Here's a patch that does the following:
- adds {{LeafReader.getIndexInfos()}} which bundles the index created version
and the index sort in order to keep the number of methods on LeafReader
contained. This way similarities can decide how to decode norms based on the
created version.
- adds indexCreatedVersion to {{FieldInvertedState}} so that similarities can
decide how to encode norms based on the created version
- Given that readers now know about their created version, I improved
{{IndexWriter.addIndexes(CodecReader...)}} to fail when a reader what was
created with a different version is added.
- SimilarityBase and BM25Similarity now encode directly the length (rather
than {{1/sqrt(length)}}) in a way that preserves 4 significant bits across the
whole integer range and is accurate up to 40. ClassicSimilarity is left
unmodified however.
Here is a table of the encoded lengths for every possible byte. Everything
works as if the lengths were rounded to the value in this table that is
immediately lesser.
|| Byte & 0xff || Length ||
|0|0|
|1|1|
|2|2|
|3|3|
|4|4|
|5|5|
|6|6|
|7|7|
|8|8|
|9|9|
|10|10|
|11|11|
|12|12|
|13|13|
|14|14|
|15|15|
|16|16|
|17|17|
|18|18|
|19|19|
|20|20|
|21|21|
|22|22|
|23|23|
|24|24|
|25|25|
|26|26|
|27|27|
|28|28|
|29|29|
|30|30|
|31|31|
|32|32|
|33|33|
|34|34|
|35|35|
|36|36|
|37|37|
|38|38|
|39|39|
|40|40|
|41|42|
|42|44|
|43|46|
|44|48|
|45|50|
|46|52|
|47|54|
|48|56|
|49|60|
|50|64|
|51|68|
|52|72|
|53|76|
|54|80|
|55|84|
|56|88|
|57|96|
|58|104|
|59|112|
|60|120|
|61|128|
|62|136|
|63|144|
|64|152|
|65|168|
|66|184|
|67|200|
|68|216|
|69|232|
|70|248|
|71|264|
|72|280|
|73|312|
|74|344|
|75|376|
|76|408|
|77|440|
|78|472|
|79|504|
|80|536|
|81|600|
|82|664|
|83|728|
|84|792|
|85|856|
|86|920|
|87|984|
|88|1048|
|89|1176|
|90|1304|
|91|1432|
|92|1560|
|93|1688|
|94|1816|
|95|1944|
|96|2072|
|97|2328|
|98|2584|
|99|2840|
|100|3096|
|101|3352|
|102|3608|
|103|3864|
|104|4120|
|105|4632|
|106|5144|
|107|5656|
|108|6168|
|109|6680|
|110|7192|
|111|7704|
|112|8216|
|113|9240|
|114|10264|
|115|11288|
|116|12312|
|117|13336|
|118|14360|
|119|15384|
|120|16408|
|121|18456|
|122|20504|
|123|22552|
|124|24600|
|125|26648|
|126|28696|
|127|30744|
|128|32792|
|129|36888|
|130|40984|
|131|45080|
|132|49176|
|133|53272|
|134|57368|
|135|61464|
|136|65560|
|137|73752|
|138|81944|
|139|90136|
|140|98328|
|141|106520|
|142|114712|
|143|122904|
|144|131096|
|145|147480|
|146|163864|
|147|180248|
|148|196632|
|149|213016|
|150|229400|
|151|245784|
|152|262168|
|153|294936|
|154|327704|
|155|360472|
|156|393240|
|157|426008|
|158|458776|
|159|491544|
|160|524312|
|161|589848|
|162|655384|
|163|720920|
|164|786456|
|165|851992|
|166|917528|
|167|983064|
|168|1048600|
|169|1179672|
|170|1310744|
|171|1441816|
|172|1572888|
|173|1703960|
|174|1835032|
|175|1966104|
|176|2097176|
|177|2359320|
|178|2621464|
|179|2883608|
|180|3145752|
|181|3407896|
|182|3670040|
|183|3932184|
|184|4194328|
|185|4718616|
|186|5242904|
|187|5767192|
|188|6291480|
|189|6815768|
|190|7340056|
|191|7864344|
|192|8388632|
|193|9437208|
|194|10485784|
|195|11534360|
|196|12582936|
|197|13631512|
|198|14680088|
|199|15728664|
|200|16777240|
|201|18874392|
|202|20971544|
|203|23068696|
|204|25165848|
|205|27263000|
|206|29360152|
|207|31457304|
|208|33554456|
|209|37748760|
|210|41943064|
|211|46137368|
|212|50331672|
|213|54525976|
|214|58720280|
|215|62914584|
|216|67108888|
|217|75497496|
|218|83886104|
|219|92274712|
|220|100663320|
|221|109051928|
|222|117440536|
|223|125829144|
|224|134217752|
|225|150994968|
|226|167772184|
|227|184549400|
|228|201326616|
|229|218103832|
|230|234881048|
|231|251658264|
|232|268435480|
|233|301989912|
|234|335544344|
|235|369098776|
|236|402653208|
|237|436207640|
|238|469762072|
|239|503316504|
|240|536870936|
|241|603979800|
|242|671088664|
|243|738197528|
|244|805306392|
|245|872415256|
|246|939524120|
|247|1006632984|
|248|1073741848|
|249|1207959576|
|250|1342177304|
|251|1476395032|
|252|1610612760|
|253|1744830488|
|254|1879048216|
|255|2013265944|
It is still a work-in-progress, some tests that rely on the way accuracy was
lost are not passing for instance. Feedback about eg. better ways that we could
propagate the index created version or encode the norm is welcome.
> Better encode length normalization in similarities
> --------------------------------------------------
>
> Key: LUCENE-7730
> URL: https://issues.apache.org/jira/browse/LUCENE-7730
> Project: Lucene - Core
> Issue Type: Task
> Reporter: Adrien Grand
> Attachments: LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the
> version that was used to create them (for backward compatibility,
> LUCENE-7703), we can look into storing the length normalization factor more
> efficiently.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]