[
https://issues.apache.org/jira/browse/SOLR-6969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14286485#comment-14286485
]
Mike Drob commented on SOLR-6969:
---------------------------------
Is retrying always going to be safe? That works fine after we've lost a server
and started a new one (albeit too quickly) but what about the case where two
servers both think they are responsible for that tlog? This can happen if the
original server partially dies, but still has some threads that are doing work
and haven't been cleaned up.
Looking at how other projects handle similar issues - HBase moves the entire
directory[1] to break any existing leases and ensure any other processes gets
kicked out. Maybe a retry is a good stop-gap, but is it going to be a full
solution?
[1]:
https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/MasterFileSystem.java#L310
> Just like we have to retry when the NameNode is in safemode on Solr startup,
> we also need to retry when opening a transaction log file for append when we
> get a RecoveryInProgressException.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: SOLR-6969
> URL: https://issues.apache.org/jira/browse/SOLR-6969
> Project: Solr
> Issue Type: Bug
> Components: hdfs
> Reporter: Mark Miller
> Assignee: Mark Miller
> Priority: Critical
> Fix For: 5.0, Trunk
>
>
> This can happen after a hard crash and restart. The current workaround is to
> stop and wait it out and start again. We should retry and wait a given amount
> of time as we do when we detect safe mode though.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]