[ 
https://issues.apache.org/jira/browse/SOLR-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142215#comment-15142215
 ] 

Yonik Seeley edited comment on SOLR-8586 at 2/11/16 4:22 AM:
-------------------------------------------------------------

bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that 
among other things, only does adds, no deletes.

Update: when I reverted my custom changes to the chaos test (so that it also 
did deletes), I got a high amount of shard-out-of-sync errors... seemingly even 
more than before, so I've been trying to track those down.  What I saw were 
issues that did not look related to PeerSync... I saw missing documents from a 
shard that replicated from the leader while buffering documents, and I saw the 
missing documents come in and get buffered, pointing to transaction log 
buffering or replay issues.

Then I realized that I had tested "adds only" before committing, and tested the 
normal test after committing and doing a "git pull".  In-between those times 
was SOLR-8575, which was a fix to the HDFS tlog!  I've been looping the test 
for a number of hours with those changes reverted, and I haven't seen a 
shards-out-of-sync fail so far.  I've also done a quick review of SOLR-8575, 
but didn't see anything obviously incorrect.  The changes in that issue may 
just be uncovering another bug (due to timing) rather than causing one... too 
early to tell.

I've also been running the non-hdfs version of the test for over a day, and 
also had no inconsistent shard failures.


was (Author: [email protected]):
bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that 
among other things, only does adds, no deletes.

Update: when I reverted my custom changes to the chaos test (so that it also 
did deletes), I got a high amount of shard-out-of-sync errors... seemingly even 
more than before, so I've been trying to track those down.  What I saw were 
issues that did not look related to PeerSync... I saw missing documents from a 
shard that replicated from the leader while buffering documents, and I saw the 
missing documents come in and get buffered, pointing to transaction log 
buffering or replay issues.

Then I realized that I had tested "adds only" before committing, and tested the 
normal test after committing and doing a "git pull".  In-between those times 
was SOLR-8575, which was a fix to the HDFS tlog!  I've been looping the test 
for a number of hours with those changes reverted, and I haven't seen a 
shards-out-of-sync fail so far.  I've also done a quick review of SOLR-8575, 
but didn't see anything obviously incorrect.

I've also been running the non-hdfs version of the test for over a day, and 
also had no inconsistent shard failures.

> Implement hash over all documents to check for shard synchronization
> --------------------------------------------------------------------
>
>                 Key: SOLR-8586
>                 URL: https://issues.apache.org/jira/browse/SOLR-8586
>             Project: Solr
>          Issue Type: Improvement
>          Components: SolrCloud
>            Reporter: Yonik Seeley
>             Fix For: 5.5, master
>
>         Attachments: SOLR-8586.patch, SOLR-8586.patch, SOLR-8586.patch, 
> SOLR-8586.patch
>
>
> An order-independent hash across all of the versions in the index should 
> suffice.  The hash itself is pretty easy, but we need to figure out 
> when/where to do this check (for example, I think PeerSync is currently used 
> in multiple contexts and this check would perhaps not be appropriate for all 
> PeerSync calls?)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to