[ 
https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010126#comment-17010126
 ] 

Chris M. Hostetter commented on SOLR-14159:
-------------------------------------------

{quote}I just attached the failure I've seen with "Timeout waiting for active 
collection".
{quote}
So at a glance, the file {{FailWaitingForCollection_WithHossFix}} seems to show 
the same underlying problem of the (socket) port refusing a connection – but 
instead of it being a replica refusing hte connection (when checking if it has 
all the documents) it's the *Leader* that's refusing connections, which causes 
hte replicas to fail to be able to talk to it in order to come online and 
recover – so the {{"Timeout waiting for active collection"}} causes a failure.

Going back to an earlier comment...
{quote}{quote}Recovery was successful.
{quote}
That does not appear to be true? ...
{quote}
I see what you mean now – the _other_ logs ({{FailWithHossFix}}) seemed to show 
that the leader couldn't _initiate_ the PeerSync recovery, but (evidently ... 
still not sure where) it did eventually happen (and i just didn't notice it in 
the logs) because the replica came online and considered itself active (hence 
the test made it past the failure we see here in 
\{FailWaitingForCollection_WithHossFix}}) ... but in this case neither replica 
is able to recover at all because they were never able to "fetch" the data from 
the leader.

This smells like the same root problem as the other log: somehow/somewhere a 
Socket proxy isn't functioning properly after {{reopen()}} – In 
{{FailWithHossFix}} it's the proxy of a replica, in 
{{FailWaitingForCollection_WithHossFix}} it's the Leader.
----
definitely curious to see if _either_ type of failure reproduces with patches i 
posted earlier

> Fix errors in TestCloudConsistency
> ----------------------------------
>
>                 Key: SOLR-14159
>                 URL: https://issues.apache.org/jira/browse/SOLR-14159
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Erick Erickson
>            Assignee: Erick Erickson
>            Priority: Major
>         Attachments: FailWaitingForCollection_WithHossFix, FailWithHossFix, 
> SOLR-14159_debug.patch, SOLR-14159_proxy_logging.patch, 
> SOLR-14159_waitFor_testfix.patch, WithHossFix.patch, stdout
>
>
> Moving over here from SOLR-13486 as per Hoss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to