[
https://issues.apache.org/jira/browse/SOLR-8586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15142215#comment-15142215
]
Yonik Seeley edited comment on SOLR-8586 at 2/11/16 4:22 AM:
-------------------------------------------------------------
bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that
among other things, only does adds, no deletes.
Update: when I reverted my custom changes to the chaos test (so that it also
did deletes), I got a high amount of shard-out-of-sync errors... seemingly even
more than before, so I've been trying to track those down. What I saw were
issues that did not look related to PeerSync... I saw missing documents from a
shard that replicated from the leader while buffering documents, and I saw the
missing documents come in and get buffered, pointing to transaction log
buffering or replay issues.
Then I realized that I had tested "adds only" before committing, and tested the
normal test after committing and doing a "git pull". In-between those times
was SOLR-8575, which was a fix to the HDFS tlog! I've been looping the test
for a number of hours with those changes reverted, and I haven't seen a
shards-out-of-sync fail so far. I've also done a quick review of SOLR-8575,
but didn't see anything obviously incorrect. The changes in that issue may
just be uncovering another bug (due to timing) rather than causing one... too
early to tell.
I've also been running the non-hdfs version of the test for over a day, and
also had no inconsistent shard failures.
was (Author: [email protected]):
bq. Yep, I've been looping a custom version of the HDFS-nothing-safe test that
among other things, only does adds, no deletes.
Update: when I reverted my custom changes to the chaos test (so that it also
did deletes), I got a high amount of shard-out-of-sync errors... seemingly even
more than before, so I've been trying to track those down. What I saw were
issues that did not look related to PeerSync... I saw missing documents from a
shard that replicated from the leader while buffering documents, and I saw the
missing documents come in and get buffered, pointing to transaction log
buffering or replay issues.
Then I realized that I had tested "adds only" before committing, and tested the
normal test after committing and doing a "git pull". In-between those times
was SOLR-8575, which was a fix to the HDFS tlog! I've been looping the test
for a number of hours with those changes reverted, and I haven't seen a
shards-out-of-sync fail so far. I've also done a quick review of SOLR-8575,
but didn't see anything obviously incorrect.
I've also been running the non-hdfs version of the test for over a day, and
also had no inconsistent shard failures.
> Implement hash over all documents to check for shard synchronization
> --------------------------------------------------------------------
>
> Key: SOLR-8586
> URL: https://issues.apache.org/jira/browse/SOLR-8586
> Project: Solr
> Issue Type: Improvement
> Components: SolrCloud
> Reporter: Yonik Seeley
> Fix For: 5.5, master
>
> Attachments: SOLR-8586.patch, SOLR-8586.patch, SOLR-8586.patch,
> SOLR-8586.patch
>
>
> An order-independent hash across all of the versions in the index should
> suffice. The hash itself is pretty easy, but we need to figure out
> when/where to do this check (for example, I think PeerSync is currently used
> in multiple contexts and this check would perhaps not be appropriate for all
> PeerSync calls?)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]