[
https://issues.apache.org/jira/browse/LUCENE-8031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-8031:
--------------------------------
Attachment: LUCENE-8031.patch
Here is a patch, i didn't yet improve tests and didn't address downgrading at
all though.
I ran omitTF experiments: mean average precision on 3 test collections,
different languages, with/without stopwords, with different scoring systems.
english:
EnglishAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.3363|0.1465|0.2080|+42.0%|
|BM25|0.4492|0.2023|0.2746|+35.7%|
|I(ne)B2|0.4553|0.2151|0.2801|+30.2%|
|I(ne)B1|0.4231|0.1679|0.2539|+51.2%|
|PL2|0.3624|0.2006|0.2656|+32.4%|
|LM(dirichlet)|0.4408|0.2814|0.2851|+1.3%|
|DFI(chisquare)|0.4236|0.2493|0.2819|+13.1%|
EnglishAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.3478|0.1651|0.2052|+24.3%|
|BM25|0.4505|0.2269|0.2720|+19.9%|
|I(ne)B2|0.4563|0.2401|0.2785|+16.0%|
|I(ne)B1|0.4285|0.1992|0.2516|+26.3%|
|PL2|0.4438|0.2182|0.2617|+19.9%|
|LM(dirichlet)|0.4372|0.2827|0.2851|+0.8%|
|DFI(chisquare)|0.4380|0.2637|0.2858|+8.4%|
bengali:
BengaliAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2326|0.1211|0.1371|+13.2%|
|BM25|0.2989|0.1367|0.1673|+22.4%|
|I(ne)B2|0.3111|0.1469|0.1738|+18.3%|
|I(ne)B1|0.2886|0.1237|0.1520|+22.9%|
|PL2|0.2906|0.1372|0.1636|+19.2%|
|LM(dirichlet)|0.3007|0.1805|0.1829|+1.3%|
|DFI(chisquare)|0.2938|0.1678|0.1790|+6.7%|
BengaliAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2266|0.1231|0.1360|+10.5%|
|BM25|0.2947|0.1390|0.1649|+18.6%|
|I(ne)B2|0.3074|0.1485|0.1723|+16.0%|
|I(ne)B1|0.2848|0.1248|0.1486|+19.1%|
|PL2|0.2856|0.1377|0.1608|+16.8%|
|LM(dirichlet)|0.2982|0.1803|0.1836|+1.8%|
|DFI(chisquare)|0.2887|0.1703|0.1810|+6.3%|
kurdish:
SoraniAnalyzer(CharArraySet.EMPTY_SET)
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2957|0.1625|0.1811|+11.4%|
|BM25|0.3207|0.1871|0.2087|+11.5%|
|I(ne)B2|0.3354|0.1937|0.2113|+9.1%|
|I(ne)B1|0.3263|0.1762|0.1992|+13.1%|
|PL2|0.3134|0.1738|0.2002|+15.2%|
|LM(dirichlet)|0.2877|0.2130|0.2149|+0.9%|
|DFI(chisquare)|0.3157|0.2014|0.2129|+5.7%|
SoraniAnalyzer()
||Sim||DOCS_AND_FREQS||DOCS (master)||DOCS (patch)||diff||
|Classic|0.2977|0.1654|0.1781|+7.7%|
|BM25|0.3205|0.1918|0.2077|+8.3%|
|I(ne)B2|0.3345|0.1979|0.2107|+6.5%|
|I(ne)B1|0.3266|0.1798|0.1970|+9.6%|
|PL2|0.3115|0.1761|0.1998|+13.5%|
|LM(dirichlet)|0.2815|0.2116|0.2144|+1.3%|
|DFI(chisquare)|0.3143|0.2022|0.2115|+4.6%|
> DOCS_ONLY fields set incorrect length norms
> -------------------------------------------
>
> Key: LUCENE-8031
> URL: https://issues.apache.org/jira/browse/LUCENE-8031
> Project: Lucene - Core
> Issue Type: Bug
> Reporter: Robert Muir
> Attachments: LUCENE-8031.patch
>
>
> Term frequencies are discarded in the DOCS_ONLY case from the postings list
> but they still count against the length normalization, which looks like it
> may screw stuff up.
> I ran some quick experiments on LUCENE-8025, by encoding
> fieldInvertState.getUniqueTermCount() and it seemed worth fixing (e.g. 20% or
> 30% improvement potentially). Happy to do testing for real, if we want to fix.
> But this seems tricky, today you can downgrade to DOCS_ONLY on the fly, and
> its hard for me to think about that case (i think its generally screwed up
> besides this, but still).
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]