Andreas Daffner created SOLR-8893:
-------------------------------------
Summary: Wrong TermVector docfreq calculation with enabled
ExactStatsCache
Key: SOLR-8893
URL: https://issues.apache.org/jira/browse/SOLR-8893
Project: Solr
Issue Type: Bug
Affects Versions: 5.5
Reporter: Andreas Daffner
Hi,
we are currently facing the issue that some calculated values of the TV
component are obviously wrong with enabled
ExactStatsCache. --> shard-wide TV docfreq calculation
Maybe the problem is very trivial and we configured something wrong ...
So lets go deeper into that problem:
1) The problem in summary:
==================
We are requesting with enabled "tv.df", "tv.tf" and "tv.tf_idf" -->
{code}
tv.df=true&tv.tf_idf=true&tv.tf=true
{code}
additionally for debugging purposes we are requesting by calling
{code}
termfreq("site_term_maincontent","abakus"),docfreq("site_maincontent_term_wdf","abakus"),ttf("site_maincontent_term_wdf","abakus")
{code}
Our findings are:
- the tv.tf as well as the termfreq seems to be correct
- the tv.df as well as the docfreq is obviously wrong
- the tv.tf_idf as well as ttf is wrong as well, I guess as subsequent fault of
the tv.df (docfeq)
2) What we have:
===========
schema.xml:
{code}
...
<field name="site_maincontent_term_wdf" type="text_token_wdf"
indexed="true" stored="true" termVectors="true"
termPositions="true" termOffsets="true"/>
...
<fieldType name="text_token_wdf" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory"
mapping="mapping.txt"/>
</analyzer>
</fieldType>
...
{code}
solrconfig.xml:
{code}
...
<statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
...
<searchComponent name="tvComponent"
class="org.apache.solr.handler.component.TermVectorComponent"/>
<requestHandler name="/tvrh"
class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<bool name="tv">true</bool>
</lst>
<arr name="last-components">
<str>tvComponent</str>
</arr>
</requestHandler>
...
{code}
You can find out any details here:
http://149.202.5.192:8820/solr/#/SingleDomainSite_34_shard1_replica1
3) Examples
========
If you are calling this link you can see that there are 6 existent documents
containing the word "abakus" in the field "site_maincontent_term_wdf" ...
http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?q=site_maincontent_term_wdf%3Aabakus+AND+site_headercode%3A200&shards.qt=%2Ftvrh&tv.fl=site_maincontent_term_wdf&tv.df=true&tv.tf_idf=true&tv.tf=true&fl=site_url_id,site_url,termfreq%28%22site_term_maincontent%22,%22abakus%22%29,docfreq%28%22site_maincontent_term_wdf%22,%22abakus%22%29,ttf%28%22site_maincontent_term_wdf%22,%22abakus%22%29
But if you are looking into the field "docfreq" in the output documents, it is
incorrect and always different (sould be always the same ...).
"docfreq(field,term) returns the number of documents that contain the term in
the field. This is a constant (the same value for all documents in the index)."
Here is a link with enabled shards.info:
http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?&wt=xml&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=10&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&shards.qt=/tvrh&shards.info=true
Here is a link with enabled debug:
http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?omitHeader=true&shards.qt=%2Ftvrh&wt=xml&json.nl=flat&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=1000&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&debugQuery=true
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]