[jira] [Comment Edited] (SOLR-7134) Replication can still cause index corruption.

Mark Miller (JIRA) Wed, 25 Feb 2015 16:29:19 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337561#comment-14337561
 ]


Mark Miller edited comment on SOLR-7134 at 2/26/15 12:28 AM:
-------------------------------------------------------------

bq. Is it ok to kill the whole operation from inside of a debug block? 

>From my perspective I'd rather that as I'm more likely to notice there is an 
>issue. Probably nicer for the end user to chug along though.

bq. Is this related, or just incidental test cleanup?

Rarely, shutting down hdfs can throw an exception - in some cases because it 
cannot find a class it wants, in other cases a null pointer exception - 
starting and stopping hdfs test issues are outside of what we are testing 
though - it shouldn't randomly fail our chaosmonkey tests.

bq. No reason to remove this constructor, I think.

I'd rather force devs to pay attention to the other params they use than offer 
more constructors here. 

bq. Why is there a difference?

Tests can last much too long if when we have no pauses between updates and we 
allow too many updates. When there are pauses, its not so bad, but the pauses 
can be so short (it's random), we still want to have some upper limit. This is 
probably a result of log replay not being able to keep up with updates coming.

bq. Possibly worth logging which core?

It's always worth doing this everywhere, but since we do it so little already 
(except when you use our special test logger, which tries to figure it out), 
I've been waiting for a more holistic fix rather doing ad hoc fixes. No real 
pain adding it here though.


was (Author: [email protected]):
bq. Is it ok to kill the whole operation from inside of a debug block? 

>From my perspective I'd rather that as I'm more likely to notice there is an 
>issue. Probably nicer for the end user to chug along though.

bq. Is this related, or just incidental test cleanup?

Rarely, shutting down hdfs can throw an exception - in some cases because it 
cannot find a class it wants, in other cases a null pointer exception - 
starting and stopping hdfs test issues are outside of what we are testing 
though - it shouldn't randomly fail our chaosmonkey tests.

bq. No reason to remove this constructor, I think.

I'd rather force devs to pay attention to the other params they use than offer 
more constructors here. 

bq. Why is there a difference?

Tests can last much too long if when we have no pauses between updates we allow 
too many updates. When there are pauses, its not so bad, but the pauses can be 
so short (it's random), we still want to have some upper limit. This is 
probably a result of log replay not being able to keep up with updates coming.

bq. Possibly worth logging which core?

It's always worth doing this everywhere, but since we do it so little already 
(except when you use our special test logger, which tries to figure it out), 
I've been waiting for a more holistic fix rather doing ad hoc fixes. No real 
pain adding it here though.

> Replication can still cause index corruption.
> ---------------------------------------------
>
>                 Key: SOLR-7134
>                 URL: https://issues.apache.org/jira/browse/SOLR-7134
>             Project: Solr
>          Issue Type: Bug
>          Components: replication (java)
>            Reporter: Mark Miller
>            Assignee: Mark Miller
>            Priority: Critical
>             Fix For: Trunk, 5.1
>
>         Attachments: SOLR-7134.patch, SOLR-7134.patch
>
>
> While we have plugged most of these holes, there appears to be another that 
> is fairly rare.
> I've seen it play out a couple ways in tests, but it looks like part of the 
> problem is that even if we decide we need a file and download it, we don't 
> care if we then cannot move it into place if it already exists.
> I'm working with a fix that does two things:
> * Fail a replication attempt if we cannot move a file into place because it 
> already exists.
> * If a replication attempt during recovery fails, on the next attempt force a 
> full replication to a new directory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (SOLR-7134) Replication can still cause index corruption.

Reply via email to