bootstrap failure and strange gossiper state
I am also experiencing issues bootstrapping new nodes in my 2.0.10 Cassandra cluster. The first attempt to bootstrap ALWAYS fails, followed by a second bootstrap attempt that ALWAYS succeeds. The first attempt at bootstrapping fails with: INFO [main] 2015-03-15 02:41:02,550 StorageService.java (line 966) JOINING: Starting to bootstrap... ERROR [main] 2015-03-15 02:41:02,872 CassandraDaemon.java (line 513) Exception encountered during startup java.lang.IllegalStateException: unable to find sufficient sources for streaming range (7169067280919608187,7171404468239785904] at org.apache.cassandra.dht.RangeStreamer.getRangeFetchMap(RangeStreamer.java:201) at org.apache.cassandra.dht.RangeStreamer.addRanges(RangeStreamer.java:125) at org.apache.cassandra.dht.BootStrapper.bootstrap(BootStrapper.java:72) at org.apache.cassandra.service.StorageService.bootstrap(StorageService.java:994) at org.apache.cassandra.service.StorageService.joinTokenRing(StorageService.java:797) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:612) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:502) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:378) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:496) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:585) After the failure, I 1. stop the node, 2. clear our the data/saved_caches/commitlog directories, and 3. remove the node from all peers (usually by manually deleting the node from their peers table) 4. restart the node to re-attempt bootstrap The bootstrap always seems to work on this second attempt. I tried comparing the logs from the failed bootstrap and the successful one, and the main difference I see is that the failed bootstrap contains many unknown endpoint lines: INFO [main] 2015-03-15 02:40:25,160 StorageService.java (line 966) JOINING: waiting for ring information INFO [HANDSHAKE-/10.30.30.30] 2015-03-15 02:40:25,175 OutboundTcpConnection.java (line 386) Handshaking version with /10.30.30.30 INFO [HANDSHAKE-/10.4.4.4] 2015-03-15 02:40:25,279 OutboundTcpConnection.java (line 386) Handshaking version with /10.4.4.4 INFO [HANDSHAKE-/10.20.20.20] 2015-03-15 02:40:25,383 OutboundTcpConnection.java (line 386) Handshaking version with /10.20.20.20 INFO [HANDSHAKE-/10.10.10.10] 2015-03-15 02:40:25,489 OutboundTcpConnection.java (line 386) Handshaking version with /10.10.10.10 INFO [HANDSHAKE-/10.5.5.5] 2015-03-15 02:40:25,596 OutboundTcpConnection.java (line 386) Handshaking version with /10.5.5.5 INFO [RequestResponseStage:3] 2015-03-15 02:40:25,700 Gossiper.java (line 876) InetAddress /10.2.2.2 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,701 FailureDetector.java (line 200) unknown endpoint /10.2.2.2 ERROR [MigrationStage:1] 2015-03-15 02:40:25,701 MigrationTask.java (line 55) Can't send migration request: node /10.2.2.2 is down. INFO [RequestResponseStage:4] 2015-03-15 02:40:25,716 Gossiper.java (line 876) InetAddress /10.1.1.1 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,716 FailureDetector.java (line 200) unknown endpoint /10.1.1.1 ERROR [MigrationStage:1] 2015-03-15 02:40:25,716 MigrationTask.java (line 55) Can't send migration request: node /10.1.1.1 is down. INFO [RequestResponseStage:1] 2015-03-15 02:40:25,719 Gossiper.java (line 876) InetAddress /10.3.3.3 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,720 FailureDetector.java (line 200) unknown endpoint /10.3.3.3 ERROR [MigrationStage:1] 2015-03-15 02:40:25,720 MigrationTask.java (line 55) Can't send migration request: node /10.3.3.3 is down. INFO [RequestResponseStage:2] 2015-03-15 02:40:25,739 Gossiper.java (line 876) InetAddress /10.4.4.4 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,739 FailureDetector.java (line 200) unknown endpoint /10.4.4.4 ERROR [MigrationStage:1] 2015-03-15 02:40:25,740 MigrationTask.java (line 55) Can't send migration request: node /10.4.4.4 is down. INFO [RequestResponseStage:3] 2015-03-15 02:40:25,742 Gossiper.java (line 876) InetAddress /10.30.30.30 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,743 FailureDetector.java (line 200) unknown endpoint /10.30.30.30 ERROR [MigrationStage:1] 2015-03-15 02:40:25,743 MigrationTask.java (line 55) Can't send migration request: node /10.30.30.30 is down. INFO [RequestResponseStage:4] 2015-03-15 02:40:25,747 Gossiper.java (line 876) InetAddress /10.20.20.20 is now UP ERROR [MigrationStage:1] 2015-03-15 02:40:25,747 FailureDetector.java (line 200) unknown endpoint /10.20.20.20 ERROR [MigrationStage:1] 2015-03-15 02:40:25,748 MigrationTask.java (line 55) Can't send migration request: node /10.20.20.20 is down. INFO [RequestResponseStage:1] 2015-03-15 02:40:25,823 Gossiper.java (line 876) InetAddress /10.5.5.5 is now UP ERROR [MigrationStage:1] 2015-03-15
Re: Cassandra metrics Graphite
This seemed to be due to a bug with how metric names are converted to file system paths. os.path.join() is used, but the metric path converts into an absolute path (e.g /org/apache/cassandra). This means you end up doing something like: os.path.join('/opt/graphite/storage/whatever', '/org/apache/cassandra/etc') the metric name gets converted to a path by replacing all dots with slashes. I just manually tweaked the Python code to strip any leading dots from the metric name as a temporary workaround. -Karl On Dec 17, 2014, at 11:04 AM, Nigel LEACH nigel.le...@uk.bnpparibas.com wrote: I'm running Cassandra Cassandra 2.0.11.83 (via DSE 4.6.0), and Graphite 0.9.10. I know a bit about Cassandra, but not much about Graphite. Our Graphite server exposes system metrics, and also those from the example python scripts, successfully. I can see Cassandra metrics hitting the Graphite server, but in the console log, errors suggest they are attempting to load in to the root file system exceptions.IOError: [Errno 2] No such file or directory: '/org/apache/cassandra/metrics/ColumnFamily/system/sstable_activity/WriteTotalLatency/count.wsp' Whereas, I think, it should be going to something like this /var/lib/carbon/whisper/carbon/agents/org/apache/cassandra/metrics/ColumnFamily/system/sstable_activity/WriteTotalLatency/count.wsp' I'm losing the prefix directory path somewhere, but don't know where to configure it. On the Cassandra side all I have added is a call to metricsGraphite.yaml, which contains graphite: - period: 60 timeunit: 'SECONDS' hosts: - host: '10.11.12.13' port: 2003 predicate: color: white useQualifiedName: true patterns: - ^org.apache.cassandra.metrics.+ On the Graphite side I simply have the following in Carbons' storage-schemas.conf file [cassandra] pattern=cassandra retentions = 60:90d Any hints to what is going wrong? Many Thanks Nigel ___ This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorised copying, disclosure or distribution of the material in this e-mail is prohibited. Please refer to http://www.bnpparibas.co.uk/en/email-disclaimer/ for additional disclosures.
2.0.10 debian/ubuntu apt release?
Hi, Wondering when 2.0.10 will be available through the datastax apt repository? -Karl
Re: 2.0.10 debian/ubuntu apt release?
Awesome! Thanks! -Karl On Sep 12, 2014, at 5:34 PM, Michael Shuler mich...@pbandjelly.org wrote: On 09/12/2014 01:50 PM, Karl Rieb wrote: Hi, Wondering when 2.0.10 will be available through the datastax apt repository? I'll have 2.0.10 deb/rpm packages in the repos on Monday, barring any issues. You can certainly pull the identical cassandra deb package from the Apache apt repository. Thanks for your patience! http://www.apache.org/dist/cassandra/debian/pool/main/c/cassandra/cassandra_2.0.10_all.deb sources.list entry: deb http://www.apache.org/dist/cassandra/debian 20x main Apache Cassandra apt repo key instructions are here: http://wiki.apache.org/cassandra/DebianPackaging -- Kind regards, Michael
Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
I did not include unit tests in my patch. I think many people did not run into this issue because many Cassandra clients handle the DateType when found as a CUSTOM type. -Karl On Jul 21, 2014, at 8:26 PM, Robert Coli rc...@eventbrite.com wrote: On Mon, Jul 21, 2014 at 1:58 AM, Ben Hood 0x6e6...@gmail.com wrote: On Sat, Jul 19, 2014 at 7:35 PM, Karl Rieb karl.r...@gmail.com wrote: Can now be followed at: https://issues.apache.org/jira/browse/CASSANDRA-7576. Nice work! Finally we have a proper solution to this issue, so well done to you. For reference, I consider this issue of sufficient severity to recommend against upgrading to any version of 2.0 before 2.0.10, unless you are certain you have no such schema. I'm pretty sure reversed comparator timestamps are a common type of schema, given that there are blog posts recommending their use, so I struggle to understand how this was not detected by unit tests. Does your fix add unit tests which would catch this case on upgrade? =Rob
Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
Ben! I think I have an idea of exactly where the bug is! I did some more searching and discovered the difference that causes some tables to produce the wrong type and others to be okay: *the tables with the wrong type reverse the ordering of the timestamp column*. The bug is in org.apache.cassandra.transport.DataType:fromType(AbstractType) : public static PairDataType, Object fromType(AbstractType type) { // For CQL3 clients, ReversedType is an implementation detail and they // shouldn't have to care about it. if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) else if (type instanceof DateType) type = TimestampType.instance; DataType dt = dataTypeMap.get(type); if (dt == null) { if (type.isCollection()) { if (type instanceof ListType) { return Pair.DataType, Objectcreate(LIST, ((ListType)type).elements); } else if (type instanceof MapType) { MapType mt = (MapType)type; return Pair.DataType, Objectcreate(MAP, Arrays.asList(mt.keys, mt.values)); } else { assert type instanceof SetType; return Pair.DataType, Objectcreate(SET, ((SetType)type).elements); } } return Pair.DataType, Objectcreate(CUSTOM, type.toString()); } else { return Pair.create(dt, null); } } The issue is the else if, which does not check the base type of the reversed column: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) *else if* (type instanceof DateType) type = TimestampType.instance; The else should be removed to make it just: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) *if* (type instanceof DateType) type = TimestampType.instance; This way we do a check for DataType on the base type of reversed columns! I applied the fix to my 2.0.9 cassandra node and the errors go away! Could you guys please make this single-word fix? -Karl On Fri, Jul 18, 2014 at 1:30 PM, Ben Hood 0x6e6...@gmail.com wrote: On Fri, Jul 18, 2014 at 3:03 PM, Karl Rieb karl.r...@gmail.com wrote: Why is the protocol ID correct for some tables but not others? I have no idea. Why does it work when I do a clean install on a new 2.0.x cluster? I still have no idea. The bug seems to be on the Cassandra side and the clients seem to just be providing patches to these issues. It was reported to the Cassandra list, but there was no answer, potentially because the query was sent to the wrong list, but I don't really know. Maybe it should have gone into Jira, but it's unclear as to whether this is a client or a server issue. In any case, it didn't look like the server behavior was going to change any time soon, so we just took the pragmatic approach in gocql and worked around the issue. I will post to the Datastax java driver mailing list and see if they are willing to add a patch. That sounds like a good idea, seeing as the workaround has been tested before. Sorry to be of little help to you.
Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
Will do! On Jul 19, 2014, at 11:22 AM, Robert Stupp sn...@snazy.de wrote: Can you submit a ticket in C* JIRA at issues.apache.org? -- Sent from my iPhone Am 19.07.2014 um 16:45 schrieb Karl Rieb karl.r...@gmail.com: Ben! I think I have an idea of exactly where the bug is! I did some more searching and discovered the difference that causes some tables to produce the wrong type and others to be okay: the tables with the wrong type reverse the ordering of the timestamp column. The bug is in org.apache.cassandra.transport.DataType:fromType(AbstractType): public static PairDataType, Object fromType(AbstractType type) { // For CQL3 clients, ReversedType is an implementation detail and they // shouldn't have to care about it. if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) else if (type instanceof DateType) type = TimestampType.instance; DataType dt = dataTypeMap.get(type); if (dt == null) { if (type.isCollection()) { if (type instanceof ListType) { return Pair.DataType, Objectcreate(LIST, ((ListType)type).elements); } else if (type instanceof MapType) { MapType mt = (MapType)type; return Pair.DataType, Objectcreate(MAP, Arrays.asList(mt.keys, mt.values)); } else { assert type instanceof SetType; return Pair.DataType, Objectcreate(SET, ((SetType)type).elements); } } return Pair.DataType, Objectcreate(CUSTOM, type.toString()); } else { return Pair.create(dt, null); } } The issue is the else if, which does not check the base type of the reversed column: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) else if (type instanceof DateType) type = TimestampType.instance; The else should be removed to make it just: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) if (type instanceof DateType) type = TimestampType.instance; This way we do a check for DataType on the base type of reversed columns! I applied the fix to my 2.0.9 cassandra node and the errors go away! Could you guys please make this single-word fix? -Karl On Fri, Jul 18, 2014 at 1:30 PM, Ben Hood 0x6e6...@gmail.com wrote: On Fri, Jul 18, 2014 at 3:03 PM, Karl Rieb karl.r...@gmail.com wrote: Why is the protocol ID correct for some tables but not others? I have no idea. Why does it work when I do a clean install on a new 2.0.x cluster? I still have no idea. The bug seems to be on the Cassandra side and the clients seem to just be providing patches to these issues. It was reported to the Cassandra list, but there was no answer, potentially because the query was sent to the wrong list, but I don't really know. Maybe it should have gone into Jira, but it's unclear as to whether this is a client or a server issue. In any case, it didn't look like the server behavior was going to change any time soon, so we just took the pragmatic approach in gocql and worked around the issue. I will post to the Datastax java driver mailing list and see if they are willing to add a patch. That sounds like a good idea, seeing as the workaround has been tested before. Sorry to be of little help to you.
Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
Can now be followed at: https://issues.apache.org/jira/browse/CASSANDRA-7576 . On Sat, Jul 19, 2014 at 1:03 PM, Karl Rieb karl.r...@gmail.com wrote: Will do! On Jul 19, 2014, at 11:22 AM, Robert Stupp sn...@snazy.de wrote: Can you submit a ticket in C* JIRA at issues.apache.org? -- Sent from my iPhone Am 19.07.2014 um 16:45 schrieb Karl Rieb karl.r...@gmail.com: Ben! I think I have an idea of exactly where the bug is! I did some more searching and discovered the difference that causes some tables to produce the wrong type and others to be okay: *the tables with the wrong type reverse the ordering of the timestamp column*. The bug is in org.apache.cassandra.transport.DataType:fromType(AbstractType): public static PairDataType, Object fromType(AbstractType type) { // For CQL3 clients, ReversedType is an implementation detail and they // shouldn't have to care about it. if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) else if (type instanceof DateType) type = TimestampType.instance; DataType dt = dataTypeMap.get(type); if (dt == null) { if (type.isCollection()) { if (type instanceof ListType) { return Pair.DataType, Objectcreate(LIST, ((ListType)type).elements); } else if (type instanceof MapType) { MapType mt = (MapType)type; return Pair.DataType, Objectcreate(MAP, Arrays.asList(mt.keys, mt.values)); } else { assert type instanceof SetType; return Pair.DataType, Objectcreate(SET, ((SetType)type).elements); } } return Pair.DataType, Objectcreate(CUSTOM, type.toString()); } else { return Pair.create(dt, null); } } The issue is the else if, which does not check the base type of the reversed column: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) *else if* (type instanceof DateType) type = TimestampType.instance; The else should be removed to make it just: if (type instanceof ReversedType) type = ((ReversedType)type).baseType; // For compatibility sake, we still return DateType as the timestamp type in resultSet metadata (#5723) *if* (type instanceof DateType) type = TimestampType.instance; This way we do a check for DataType on the base type of reversed columns! I applied the fix to my 2.0.9 cassandra node and the errors go away! Could you guys please make this single-word fix? -Karl On Fri, Jul 18, 2014 at 1:30 PM, Ben Hood 0x6e6...@gmail.com wrote: On Fri, Jul 18, 2014 at 3:03 PM, Karl Rieb karl.r...@gmail.com wrote: Why is the protocol ID correct for some tables but not others? I have no idea. Why does it work when I do a clean install on a new 2.0.x cluster? I still have no idea. The bug seems to be on the Cassandra side and the clients seem to just be providing patches to these issues. It was reported to the Cassandra list, but there was no answer, potentially because the query was sent to the wrong list, but I don't really know. Maybe it should have gone into Jira, but it's unclear as to whether this is a client or a server issue. In any case, it didn't look like the server behavior was going to change any time soon, so we just took the pragmatic approach in gocql and worked around the issue. I will post to the Datastax java driver mailing list and see if they are willing to add a patch. That sounds like a good idea, seeing as the workaround has been tested before. Sorry to be of little help to you.
Re: DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
Thanks Ben, I found that thread, but my concern is the inconsistency on the Cassandra side. Why is the protocol ID correct for some tables but not others? Why does it work when I do a clean install on a new 2.0.x cluster? The bug seems to be on the Cassandra side and the clients seem to just be providing patches to these issues. I will post to the Datastax java driver mailing list and see if they are willing to add a patch. -Karl On Jul 18, 2014, at 3:59 AM, Ben Hood 0x6e6...@gmail.com wrote: On Fri, Jul 18, 2014 at 3:38 AM, Karl Rieb karl.r...@gmail.com wrote: Any suggestions on what is going on or how to fix it? I'm not sure how much this will help, but one of the gocql users reported similar symptoms when upgrading to 2.0.6. We ended up applying a client side patch to address the issue, the details are here: https://github.com/gocql/gocql/pull/154 That pull request also references the original bug report: https://github.com/gocql/gocql/issues/151 Not sure how helpful this will be though.
DataType protocol ID error for TIMESTAMPs when upgrading from 1.2.11 to 2.0.9
Hi, I've been testing an in-place upgrade of a 1.2.11 cluster to 2.0.9. The 1.2.11 nodes all have a schema defined through CQL with existing data before I perform the rolling upgrade. While the upgrade is in progress, services are continuing to read and write data to the cluster (strictly using protocol version 1). I drain each node one at a time, upgrade the configuration files, upgrade cassandra, then start the node back up. The cassandra logs show no errors or exceptions during startup and appear to join properly with the other nodes in the cluster. On our service side, everything goes smoothly except for queries against a few of our tables. On some of the tables with timestamp columns (not all), we will get an error from the Datastax java-driver when binding PreparedStatements or trying to process ResultSets: com.datastax.driver.core.exceptions.InvalidTypeException: Invalid type for value 2 of CQL type 'org.apache.cassandra.db.marshal.DateType', expecting class java.nio.ByteBuffer but class java.util.Date provided at com.datastax.driver.core.BoundStatement.bind(BoundStatement.java:190) at com.datastax.driver.core.DefaultPreparedStatement.bind(DefaultPreparedStatement.java:103) I traced the code on the driver side, and I see it has to do with bad DataType information coming back from a table metadata query. The 2.0.9 nodes will return protocol ID 0 instead of 11 for some timestamp column definitions. The protocol ID 0 maps to a custom type, and the 2.0.9 nodes specify org.apache.cassandra.db.marshal.DateType as the custom type name. The 1.2.11 nodes, however, continue to send 11 for their protocol ID, which gets properly mapped to the timestamp data type. Strangely not all our tables with timestamp columns have this issue. If I bring up an entirely new 2.0.9 cluster (no existing data), and provision our schema, then there are no issues. The proper protocol ID, 11, gets sent for all our tables with timestamp columns. I have tried doing nodetool upgradesstables and nodetool scrub on the nodes, but neither fixes the issue. Any suggestions on what is going on or how to fix it?