[jira] [Commented] (IGNITE-8633) Node fails to bail out of wrong BLT, instead hanging around indefinitely

Sergey Chugunov (JIRA) Wed, 30 May 2018 06:46:06 -0700


    [ 
https://issues.apache.org/jira/browse/IGNITE-8633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16495180#comment-16495180
 ]


Sergey Chugunov commented on IGNITE-8633:
-----------------------------------------

Hi [~ilyak], 

I tried to reproduce this behavior with testing framework and everything worked 
fine: both A and C nodes were rejected to join B because of BaselineTopology 
inconsistency.

Attached logs made me think that Discovery didn't reach BLT checks but got 
stuck at some point before. Could you please turn on debug logging for tcp 
discovery package (org.apache.ignite.spi.discovery.tcp) and run the test in 
your environment once again?

> Node fails to bail out of wrong BLT, instead hanging around indefinitely
> ------------------------------------------------------------------------
>
>                 Key: IGNITE-8633
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8633
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.4
>            Reporter: Ilya Kasnacheev
>            Assignee: Sergey Chugunov
>            Priority: Major
>         Attachments: 8633.zip
>
>
> Follow-up on 
> https://stackoverflow.com/questions/50234056/how-to-give-multiple-static-ip-in-apache-ignite-cache-configuration-xml-file/50270676?noredirect=1#comment88095814_50270676
>  but not quite the same.
> I have three nodes: A, B and C.
> I've started A and C and performed activation.
> Then I stopped them both, started B and performed activation on it.
> Now I have two BlT clusters: (A, C) and (B)
> However, when I start B; and then try to launch nodes A or C I get 
> inconsistent behavior:
> When I launch C, I get the error:
> {code}
> org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node 
> (8c1e210f-52bb-424f-9c7c-a2e7b1bab546 ) is not compatible with 
> BaselineTopology in the cluster. Branching history of cluster BlT 
> ([-1349069127]) doesn't contain branching point hash of joining node BlT 
> (631694798). Consider cleaning persistent storage of the node and adding it 
> to the cluster again.
> {code}
> But when I launch A, it never enters topology, but also never fails. 
> Moreover, A and B will ping pong each other for eternity:
> {code}
> [20:16:38,596][WARNING][main][TcpDiscoverySpi] Node has not been connected to 
> topology and will repeat join process. Check remote nodes logs for possible 
> error messages. Note that large topology may require significant time to 
> start. Increase 'TcpDiscoverySpi.networkTimeout' configuration property if 
> getting this message on the starting nodes [networkTimeout=5000]
> [20:17:29,514][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> accepted incoming connection [rmtAddr=/172.25.1.36, rmtPort=49030]
> [20:17:29,522][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> spawning a new thread for connection [rmtAddr=/172.25.1.36, rmtPort=49030]
> [20:17:29,523][INFO][tcp-disco-sock-reader-#26][TcpDiscoverySpi] Started 
> serving remote node connection [rmtAddr=/172.25.1.36:49030, rmtPort=49030]
> [20:17:29,524][INFO][tcp-disco-sock-reader-#26][TcpDiscoverySpi] Received 
> ping request from the remote node 
> [rmtNodeId=37104137-a21e-4b6f-a70b-09164300bbfc, rmtAddr=/172.25.1.36:49030, 
> rmtPort=49030]
> [20:17:29,525][INFO][tcp-disco-sock-reader-#26][TcpDiscoverySpi] Finished 
> writing ping response [rmtNodeId=37104137-a21e-4b6f-a70b-09164300bbfc, 
> rmtAddr=/172.25.1.36:49030, rmtPort=49030]
> [20:17:29,526][INFO][tcp-disco-sock-reader-#26][TcpDiscoverySpi] Finished 
> serving remote node connection [rmtAddr=/172.25.1.36:49030, rmtPort=49030
> [20:18:30,733][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> accepted incoming connection [rmtAddr=/172.25.1.36, rmtPort=50857]
> [20:18:30,733][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> spawning a new thread for connection [rmtAddr=/172.25.1.36, rmtPort=50857]
> [20:18:30,733][INFO][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Started 
> serving remote node connection [rmtAddr=/172.25.1.36:50857, rmtPort=50857]
> [20:18:30,734][INFO][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Received 
> ping request from the remote node 
> [rmtNodeId=37104137-a21e-4b6f-a70b-09164300bbfc, rmtAddr=/172.25.1.36:50857, 
> rmtPort=50857]
> [20:18:30,734][INFO][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Finished 
> writing ping response [rmtNodeId=37104137-a21e-4b6f-a70b-09164300bbfc, 
> rmtAddr=/172.25.1.36:50857, rmtPort=50857]
> [20:18:30,734][INFO][tcp-disco-sock-reader-#47][TcpDiscoverySpi] Finished 
> serving remote node connection [rmtAddr=/172.25.1.36:50857, rmtPort=50857
> {code}
> {code}
> [20:16:28,793][INFO][tcp-disco-msg-worker-#3][GridSnapshotAwareClusterStateProcessorImpl]
>  Received state change finish message: true
> [20:16:28,803][INFO][exchange-worker-#62][time] Finished exchange init 
> [topVer=AffinityTopologyVersion [topVer=1, minorTopVer=1], crd=true]
> [20:16:28,812][INFO][exchange-worker-#62][GridCachePartitionExchangeManager] 
> Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion 
> [topVer=1, minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, 
> node=37104137-a21e-4b6f-a70b-09164300bbfc]
> [20:16:28,818][INFO][sys-#68][GridSnapshotAwareClusterStateProcessorImpl] 
> Successfully performed final activation steps 
> [nodeId=37104137-a21e-4b6f-a70b-09164300bbfc, client=false, 
> topVer=AffinityTopologyVersion [topVer=1, minorTopVer=1]]
> [20:16:33,571][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> accepted incoming connection [rmtAddr=/172.25.1.35, rmtPort=42500]
> [20:16:33,579][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> spawning a new thread for connection [rmtAddr=/172.25.1.35, rmtPort=42500]
> [20:16:33,580][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Started 
> serving remote node connection [rmtAddr=/172.25.1.35:42500, rmtPort=42500]
> [20:16:33,592][INFO][tcp-disco-sock-reader-#9][TcpDiscoverySpi] Finished 
> serving remote node connection [rmtAddr=/172.25.1.35:42500, rmtPort=42500
> [20:16:39,801][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> accepted incoming connection [rmtAddr=/172.25.1.35, rmtPort=42714]
> [20:16:39,801][INFO][tcp-disco-srvr-#2][TcpDiscoverySpi] TCP discovery 
> spawning a new thread for connection [rmtAddr=/172.25.1.35, rmtPort=42714]
> [20:16:39,802][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Started 
> serving remote node connection [rmtAddr=/172.25.1.35:42714, rmtPort=42714]
> [20:16:39,806][INFO][tcp-disco-sock-reader-#10][TcpDiscoverySpi] Finished 
> serving remote node connection [rmtAddr=/172.25.1.35:42714, rmtPort=42714
> {code}
> I don't think this is expected behaviour. I will attach config and work 
> directories.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (IGNITE-8633) Node fails to bail out of wrong BLT, instead hanging around indefinitely

Reply via email to