[ https://issues.apache.org/jira/browse/SOLR-14159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010126#comment-17010126 ]
Chris M. Hostetter commented on SOLR-14159: ------------------------------------------- {quote}I just attached the failure I've seen with "Timeout waiting for active collection". {quote} So at a glance, the file {{FailWaitingForCollection_WithHossFix}} seems to show the same underlying problem of the (socket) port refusing a connection – but instead of it being a replica refusing hte connection (when checking if it has all the documents) it's the *Leader* that's refusing connections, which causes hte replicas to fail to be able to talk to it in order to come online and recover – so the {{"Timeout waiting for active collection"}} causes a failure. Going back to an earlier comment... {quote}{quote}Recovery was successful. {quote} That does not appear to be true? ... {quote} I see what you mean now – the _other_ logs ({{FailWithHossFix}}) seemed to show that the leader couldn't _initiate_ the PeerSync recovery, but (evidently ... still not sure where) it did eventually happen (and i just didn't notice it in the logs) because the replica came online and considered itself active (hence the test made it past the failure we see here in \{FailWaitingForCollection_WithHossFix}}) ... but in this case neither replica is able to recover at all because they were never able to "fetch" the data from the leader. This smells like the same root problem as the other log: somehow/somewhere a Socket proxy isn't functioning properly after {{reopen()}} – In {{FailWithHossFix}} it's the proxy of a replica, in {{FailWaitingForCollection_WithHossFix}} it's the Leader. ---- definitely curious to see if _either_ type of failure reproduces with patches i posted earlier > Fix errors in TestCloudConsistency > ---------------------------------- > > Key: SOLR-14159 > URL: https://issues.apache.org/jira/browse/SOLR-14159 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Erick Erickson > Assignee: Erick Erickson > Priority: Major > Attachments: FailWaitingForCollection_WithHossFix, FailWithHossFix, > SOLR-14159_debug.patch, SOLR-14159_proxy_logging.patch, > SOLR-14159_waitFor_testfix.patch, WithHossFix.patch, stdout > > > Moving over here from SOLR-13486 as per Hoss. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org