Andreas Daffner created SOLR-8893:
-------------------------------------

             Summary: Wrong TermVector docfreq calculation with enabled 
ExactStatsCache
                 Key: SOLR-8893
                 URL: https://issues.apache.org/jira/browse/SOLR-8893
             Project: Solr
          Issue Type: Bug
    Affects Versions: 5.5
            Reporter: Andreas Daffner


Hi,

we are currently facing the issue that some calculated values of the TV 
component are obviously wrong with enabled
ExactStatsCache. --> shard-wide TV docfreq calculation

Maybe the problem is very trivial and we configured something wrong ...

So lets go deeper into that problem:


1) The problem in summary:
==================
We are requesting with enabled "tv.df", "tv.tf" and "tv.tf_idf" --> 
{code}
tv.df=true&tv.tf_idf=true&tv.tf=true
{code}
additionally for debugging purposes we are requesting by calling 
{code}
termfreq("site_term_maincontent","abakus"),docfreq("site_maincontent_term_wdf","abakus"),ttf("site_maincontent_term_wdf","abakus")
{code}

Our findings are:
- the tv.tf as well as the termfreq seems to be correct
- the tv.df as well as the docfreq is obviously wrong
- the tv.tf_idf as well as ttf is wrong as well, I guess as subsequent fault of 
the tv.df (docfeq)


2) What we have:
===========
schema.xml:
{code}
...
        <field name="site_maincontent_term_wdf" type="text_token_wdf" 
indexed="true" stored="true" termVectors="true"
               termPositions="true" termOffsets="true"/>
...
        <fieldType name="text_token_wdf" class="solr.TextField" 
positionIncrementGap="100">
            <analyzer>
                <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                <filter class="solr.LowerCaseFilterFactory"/>
                <charFilter class="solr.MappingCharFilterFactory" 
mapping="mapping.txt"/>
            </analyzer>
        </fieldType>
...
{code}

solrconfig.xml:
{code}
...
    <statsCache class="org.apache.solr.search.stats.ExactStatsCache"/>
...
    <searchComponent name="tvComponent" 
class="org.apache.solr.handler.component.TermVectorComponent"/>
    <requestHandler name="/tvrh" 
class="org.apache.solr.handler.component.SearchHandler">
        <lst name="defaults">
            <bool name="tv">true</bool>
        </lst>
        <arr name="last-components">
            <str>tvComponent</str>
        </arr>
    </requestHandler>
...
{code}

You can find out any details here:
http://149.202.5.192:8820/solr/#/SingleDomainSite_34_shard1_replica1


3) Examples
========

If you are calling this link you can see that there are 6 existent documents 
containing the word "abakus" in the field "site_maincontent_term_wdf" ...

http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?q=site_maincontent_term_wdf%3Aabakus+AND+site_headercode%3A200&shards.qt=%2Ftvrh&tv.fl=site_maincontent_term_wdf&tv.df=true&tv.tf_idf=true&tv.tf=true&fl=site_url_id,site_url,termfreq%28%22site_term_maincontent%22,%22abakus%22%29,docfreq%28%22site_maincontent_term_wdf%22,%22abakus%22%29,ttf%28%22site_maincontent_term_wdf%22,%22abakus%22%29

But if you are looking into the field "docfreq" in the output documents, it is 
incorrect and always different (sould be always the same ...).

"docfreq(field,term) returns the number of documents that contain the term in 
the field. This is a constant (the same value for all documents in the index)."



Here is a link with enabled shards.info:
http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?&wt=xml&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=10&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&shards.qt=/tvrh&shards.info=true


Here is a link with enabled debug:
http://149.202.5.192:8820/solr/SingleDomainSite_34_shard1_replica1/tvrh?omitHeader=true&shards.qt=%2Ftvrh&wt=xml&json.nl=flat&q=site_maincontent_term_wdf%3Aabakus&start=0&rows=1000&fl=ttf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cdocfreq%28site_maincontent_term_wdf%2C%27abakus%27%29%2Cidf%28site_maincontent_term_wdf%2C%27abakus%27%29%2Csite_url&debugQuery=true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to