[ 
https://issues.apache.org/jira/browse/SOLR-4260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13558910#comment-13558910
 ] 

Markus Jelsma commented on SOLR-4260:
-------------------------------------

Ok, this is not a SolrCloud issue, i can also reproduce this in stand-alone and 
multi core set ups. This is also not a problem of BM25 since TFIDF has the same 
problem. Neither docCount vs. maxCount seems to be the problem.

I now have two identical cores set up and index the same data to both, no 
problem, everything is very consistent. Then i'll reindex the same data again 
to only one of the two cores and then the trouble starts. There is a small 
variation in maxDoc which is expected but there is also a variation in docFreq 
which is very unexpected, docFreq must not change at all if i reindex the same 
data.

Here's an debug snippet of the first core that did not receive reindexed data:

{code}
910.47974 = (MATCH) sum of:
  910.47974 = (MATCH) max plus 0.35 times others of:
    793.99835 = (MATCH) weight(title_en:groningen^6.4 in 5132) [], result of:
      793.99835 = score(doc=5132,freq=1.0 = termFreq=1.0
), product of:
        71.28527 = queryWeight, product of:
          6.4 = boost
          11.138323 = idf(docFreq=1, maxDocs=50588)
          1.0 = queryNorm
        11.138323 = fieldWeight in 5132, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          11.138323 = idf(docFreq=1, maxDocs=50588)
          1.0 = fieldNorm(doc=5132)
    312.06528 = (MATCH) weight(content_en:groningen^1.6 in 5132) [], result of:
      312.06528 = score(doc=5132,freq=1.0 = termFreq=1.0
), product of:
        17.172573 = queryWeight, product of:
          1.6 = boost
          10.732858 = idf(docFreq=2, maxDocs=50588)
          1.0 = queryNorm
        18.172308 = fieldWeight in 5132, product of:
          1.6931472 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          10.732858 = idf(docFreq=2, maxDocs=50588)
          1.0 = fieldNorm(doc=5132)
    20.73867 = (MATCH) weight(domain_grams:groningen^3.7 in 5132) [], result of:
      20.73867 = score(doc=5132,freq=1.0 = termFreq=1.0
), product of:
        26.48697 = queryWeight, product of:
          3.7 = boost
          7.158641 = idf(docFreq=106, maxDocs=50588)
          1.0 = queryNorm
        0.7829763 = fieldWeight in 5132, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          7.158641 = idf(docFreq=106, maxDocs=50588)
          0.109375 = fieldNorm(doc=5132)
{code}

Here's the debug of the same doc on the core which i reindexed the same data to:

{code}
928.31537 = (MATCH) sum of:
  928.31537 = (MATCH) max plus 0.35 times others of:
    815.29694 = (MATCH) weight(title_en:groningen^6.4 in 31881) [], result of:
      815.29694 = score(doc=31881,freq=1.0 = termFreq=1.0
), product of:
        72.23504 = queryWeight, product of:
          6.4 = boost
          11.286724 = idf(docFreq=1, maxDocs=58681)
          1.0 = queryNorm
        11.286724 = fieldWeight in 31881, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          11.286724 = idf(docFreq=1, maxDocs=58681)
          1.0 = fieldNorm(doc=31881)
    304.0185 = (MATCH) weight(content_en:groningen^1.6 in 31881) [], result of:
      304.0185 = score(doc=31881,freq=1.0 = termFreq=1.0
), product of:
        16.949724 = queryWeight, product of:
          1.6 = boost
          10.593577 = idf(docFreq=3, maxDocs=58681)
          1.0 = queryNorm
        17.936485 = fieldWeight in 31881, product of:
          1.6931472 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          10.593577 = idf(docFreq=3, maxDocs=58681)
          1.0 = fieldNorm(doc=31881)
    18.891369 = (MATCH) weight(domain_grams:groningen^3.7 in 31881) [], result 
of:
      18.891369 = score(doc=31881,freq=1.0 = termFreq=1.0
), product of:
        25.279795 = queryWeight, product of:
          3.7 = boost
          6.832377 = idf(docFreq=171, maxDocs=58681)
          1.0 = queryNorm
        0.7472912 = fieldWeight in 31881, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          6.832377 = idf(docFreq=171, maxDocs=58681)
          0.109375 = fieldNorm(doc=31881)
{code}

As you can see, docFreq has changed but the number of documents is still the 
same. Since i now suspect the merging of segments has something to do with it 
i'll send an optimize command to the node that i reindexed data to.

After optimizing (or forcing all segments to be merged) i get the same debug as 
i had for the first node that i didn't reindex to!



                
> Inconsistent numDocs between leader/replica
> -------------------------------------------
>
>                 Key: SOLR-4260
>                 URL: https://issues.apache.org/jira/browse/SOLR-4260
>             Project: Solr
>          Issue Type: Bug
>          Components: SolrCloud
>    Affects Versions: 5.0
>         Environment: 5.0.0.2013.01.04.15.31.51
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 5.0
>
>
> After wiping all cores and reindexing some 3.3 million docs from Nutch using 
> CloudSolrServer we see inconsistencies between the leader and replica for 
> some shards.
> Each core hold about 3.3k documents. For some reason 5 out of 10 shards have 
> a small deviation in then number of documents. The leader and slave deviate 
> for roughly 10-20 documents, not more.
> Results hopping ranks in the result set for identical queries got my 
> attention, there were small IDF differences for exactly the same record 
> causing a record to shift positions in the result set. During those tests no 
> records were indexed. Consecutive catch all queries also return different 
> number of numDocs.
> We're running a 10 node test cluster with 10 shards and a replication factor 
> of two and frequently reindex using a fresh build from trunk. I've not seen 
> this issue for quite some time until a few days ago.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to