[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities

2017-05-18 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7730:
-
Attachment: LUCENE-7730.patch

Here is an updated patch. I tried moving TFIDFSimilarity and ClassicSimilarity 
to misc yesterday but gave up due to how MLT and some values sources depend on 
it. This new patch removes the ability to customize the norms encoding in 
TFIDFSimilarity and makes it compatible again with other similarities (so I 
added it back to Similarity randomization).

> Better encode length normalization in similarities
> --
>
> Key: LUCENE-7730
> URL: https://issues.apache.org/jira/browse/LUCENE-7730
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
> Attachments: LUCENE-7730.patch, LUCENE-7730.patch, LUCENE-7730.patch, 
> LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to create them (for backward compatibility, 
> LUCENE-7703), we can look into storing the length normalization factor more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities

2017-05-16 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7730:
-
Attachment: LUCENE-7730.patch

New patch. It is not really possible to switch ClassicSimilarity to the new 
encoding given how it is built on the assumption that it encodes the 
normalization factor directly while the new encoding I have been working on 
encodes the length. So I ended up doing the following:
 - ClassicSimilarity will still encode norms the same way in 7.0 as it did 
before, it means it is no longer index-time compatible with, say, BM25Similarity
 - ClassicSimilarity docs have been updated to advise using BM25Similarity 
instead
 - ClassicSimilarity has been moved out of similarity randomization in the test 
framework

I'd like to get it in 7.0 as this change can only be done in a major release 
(it uses the index creation major to know which encoding to use) so please 
speak up if you have concerns.

> Better encode length normalization in similarities
> --
>
> Key: LUCENE-7730
> URL: https://issues.apache.org/jira/browse/LUCENE-7730
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
> Attachments: LUCENE-7730.patch, LUCENE-7730.patch, LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to create them (for backward compatibility, 
> LUCENE-7703), we can look into storing the length normalization factor more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities

2017-04-04 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7730:
-
Attachment: LUCENE-7730.patch

Here is a new patch that builds upon LUCENE-7756. It is not 100% ready as some 
tests still don't pass due to the fact that I did not switch ClassicSimilarity 
to a new encoding but ready for review if anyone wants to have a look.

> Better encode length normalization in similarities
> --
>
> Key: LUCENE-7730
> URL: https://issues.apache.org/jira/browse/LUCENE-7730
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
> Attachments: LUCENE-7730.patch, LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to create them (for backward compatibility, 
> LUCENE-7703), we can look into storing the length normalization factor more 
> efficiently.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Updated] (LUCENE-7730) Better encode length normalization in similarities

2017-03-03 Thread Adrien Grand (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-7730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrien Grand updated LUCENE-7730:
-
Attachment: LUCENE-7730.patch

Here's a patch that does the following:
 - adds {{LeafReader.getIndexInfos()}} which bundles the index created version 
and the index sort in order to keep the number of methods on LeafReader 
contained. This way similarities can decide how to decode norms based on the 
created version.
 - adds indexCreatedVersion to {{FieldInvertedState}} so that similarities can 
decide how to encode norms based on the created version
 - Given that readers now know about their created version, I improved 
{{IndexWriter.addIndexes(CodecReader...)}} to fail when a reader what was 
created with a different version is added.
 - SimilarityBase and BM25Similarity now encode directly the length (rather 
than {{1/sqrt(length)}}) in a way that preserves 4 significant bits across the 
whole integer range and is accurate up to 40. ClassicSimilarity is left 
unmodified however.

Here is a table of the encoded lengths for every possible byte. Everything 
works as if the lengths were rounded to the value in this table that is 
immediately lesser.
|| Byte & 0xff || Length ||
|0|0|
|1|1|
|2|2|
|3|3|
|4|4|
|5|5|
|6|6|
|7|7|
|8|8|
|9|9|
|10|10|
|11|11|
|12|12|
|13|13|
|14|14|
|15|15|
|16|16|
|17|17|
|18|18|
|19|19|
|20|20|
|21|21|
|22|22|
|23|23|
|24|24|
|25|25|
|26|26|
|27|27|
|28|28|
|29|29|
|30|30|
|31|31|
|32|32|
|33|33|
|34|34|
|35|35|
|36|36|
|37|37|
|38|38|
|39|39|
|40|40|
|41|42|
|42|44|
|43|46|
|44|48|
|45|50|
|46|52|
|47|54|
|48|56|
|49|60|
|50|64|
|51|68|
|52|72|
|53|76|
|54|80|
|55|84|
|56|88|
|57|96|
|58|104|
|59|112|
|60|120|
|61|128|
|62|136|
|63|144|
|64|152|
|65|168|
|66|184|
|67|200|
|68|216|
|69|232|
|70|248|
|71|264|
|72|280|
|73|312|
|74|344|
|75|376|
|76|408|
|77|440|
|78|472|
|79|504|
|80|536|
|81|600|
|82|664|
|83|728|
|84|792|
|85|856|
|86|920|
|87|984|
|88|1048|
|89|1176|
|90|1304|
|91|1432|
|92|1560|
|93|1688|
|94|1816|
|95|1944|
|96|2072|
|97|2328|
|98|2584|
|99|2840|
|100|3096|
|101|3352|
|102|3608|
|103|3864|
|104|4120|
|105|4632|
|106|5144|
|107|5656|
|108|6168|
|109|6680|
|110|7192|
|111|7704|
|112|8216|
|113|9240|
|114|10264|
|115|11288|
|116|12312|
|117|13336|
|118|14360|
|119|15384|
|120|16408|
|121|18456|
|122|20504|
|123|22552|
|124|24600|
|125|26648|
|126|28696|
|127|30744|
|128|32792|
|129|36888|
|130|40984|
|131|45080|
|132|49176|
|133|53272|
|134|57368|
|135|61464|
|136|65560|
|137|73752|
|138|81944|
|139|90136|
|140|98328|
|141|106520|
|142|114712|
|143|122904|
|144|131096|
|145|147480|
|146|163864|
|147|180248|
|148|196632|
|149|213016|
|150|229400|
|151|245784|
|152|262168|
|153|294936|
|154|327704|
|155|360472|
|156|393240|
|157|426008|
|158|458776|
|159|491544|
|160|524312|
|161|589848|
|162|655384|
|163|720920|
|164|786456|
|165|851992|
|166|917528|
|167|983064|
|168|1048600|
|169|1179672|
|170|1310744|
|171|1441816|
|172|1572888|
|173|1703960|
|174|1835032|
|175|1966104|
|176|2097176|
|177|2359320|
|178|2621464|
|179|2883608|
|180|3145752|
|181|3407896|
|182|3670040|
|183|3932184|
|184|4194328|
|185|4718616|
|186|5242904|
|187|5767192|
|188|6291480|
|189|6815768|
|190|7340056|
|191|7864344|
|192|8388632|
|193|9437208|
|194|10485784|
|195|11534360|
|196|12582936|
|197|13631512|
|198|14680088|
|199|15728664|
|200|16777240|
|201|18874392|
|202|20971544|
|203|23068696|
|204|25165848|
|205|27263000|
|206|29360152|
|207|31457304|
|208|33554456|
|209|37748760|
|210|41943064|
|211|46137368|
|212|50331672|
|213|54525976|
|214|58720280|
|215|62914584|
|216|6710|
|217|75497496|
|218|83886104|
|219|92274712|
|220|100663320|
|221|109051928|
|222|117440536|
|223|125829144|
|224|134217752|
|225|150994968|
|226|167772184|
|227|184549400|
|228|201326616|
|229|218103832|
|230|234881048|
|231|251658264|
|232|268435480|
|233|301989912|
|234|335544344|
|235|369098776|
|236|402653208|
|237|436207640|
|238|469762072|
|239|503316504|
|240|536870936|
|241|603979800|
|242|671088664|
|243|738197528|
|244|805306392|
|245|872415256|
|246|939524120|
|247|1006632984|
|248|1073741848|
|249|1207959576|
|250|1342177304|
|251|1476395032|
|252|1610612760|
|253|1744830488|
|254|1879048216|
|255|2013265944|

It is still a work-in-progress, some tests that rely on the way accuracy was 
lost are not passing for instance. Feedback about eg. better ways that we could 
propagate the index created version or encode the norm is welcome.

> Better encode length normalization in similarities
> --
>
> Key: LUCENE-7730
> URL: https://issues.apache.org/jira/browse/LUCENE-7730
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Adrien Grand
> Attachments: LUCENE-7730.patch
>
>
> Now that index-time boosts are gone (LUCENE-6819) and that indices record the 
> version that was used to