[ https://issues.apache.org/jira/browse/CASSANDRA-7734?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092169#comment-14092169 ]
graham sanderson commented on CASSANDRA-7734: --------------------------------------------- Note this is not a problem _during_ the upgrade; it is a problem after the upgrade with all nodes successfully on 2.0.9 I'm a bit confused from a technical perspective, so would welcome any comments from others who have been near this code: [~iamaleksey], [~jbellis] I'm not sure the lifecycle of IncomingTcpConnection... but there is code there (close method) {code} MessagingService.instance().resetVersion(from); {code} That unsets the (staticly scoped) version for an endpoint when closing... I would assume there could be overlapping connections for an endpoint, so this seems undesirable? Also {code} MessagingService.instance().knowsVersion(endpoint) && MessagingService.instance().getRawVersion(endpoint) == MessagingService.current_version) {code} Since the endpoint->version mapping is static global and concurrent, we shouldn't be checking it twice Also CASSANDRA-6700 changes public boolean knowsVersion(InetAddress endpoint) { - return versions.get(endpoint) != null; + return versions.containsKey(endpoint); } However it is not clear that the map can ever contain a null value, and the getVersion() method still does the check the old way (versions.get(endpoint) != null) In any case, I'm partly confused because I'm not quite sure how this endpoint version tracking is supposed to work, and the current state seems to have evolved as a result of lots of different issues (I don't think I've captured all of them here). > Schema pushes (seemingly) randomly not happening > ------------------------------------------------ > > Key: CASSANDRA-7734 > URL: https://issues.apache.org/jira/browse/CASSANDRA-7734 > Project: Cassandra > Issue Type: Bug > Reporter: graham sanderson > > We have been seeing problems since upgrade to 2.0.9 from 2.0.5. > Basically after a while, schema changes start propagating slowly from some > nodes to others. It looks from the logs and trace that in this case the > "push" of the schema never happens (note a node has decided not to push to > another node, it doesn't seem to start again). In this case though, we do see > the other node end up pulling the request some time later when it notices its > schema is out of date. > Here is code from 2.0.9 MigrationManager.announce > {code} > for (InetAddress endpoint : Gossiper.instance.getLiveMembers()) > { > // only push schema to nodes with known and equal versions > if (!endpoint.equals(FBUtilities.getBroadcastAddress()) && > MessagingService.instance().knowsVersion(endpoint) && > MessagingService.instance().getRawVersion(endpoint) == > MessagingService.current_version) > pushSchemaMutation(endpoint, schema); > } > {code} > and from 2.0.5 > {code} > for (InetAddress endpoint : Gossiper.instance.getLiveMembers()) > { > if (endpoint.equals(FBUtilities.getBroadcastAddress())) > continue; // we've dealt with localhost already > // don't send schema to the nodes with the versions older than > current major > if (MessagingService.instance().getVersion(endpoint) < > MessagingService.current_version) > continue; > pushSchemaMutation(endpoint, schema); > } > {code} > the old getVersion() call would return MessagingService.current_version if > the version was unknown, so the push would occur in this case. I don't have > logging to prove this, but have strong suspicion that the version may end up > null in some cases (which would have allowed schema propagation in 2.0.5, but > not by somewhere after that and <= 2.0.9) -- This message was sent by Atlassian JIRA (v6.2#6252)