Re: Node forgets about most of its column families
But the following nodetool repair crashes. It has to be stopped and then re-started. How did it crash ? Are there any suggestions for logging or similar so that we can get a clue next time this happens. Can you make the logs from #5 available? If you feel you can describe the situation please create a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 8:38 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: For the record, we just had a recurrence of this. This time, when the node (#5) came back it didn't properly rejoin the ring. We stopped every node and brought them back one by one to get the ring to link up correctly. Then, all the even nodes (#2, #4, #6) had out of data schemas. nodetool resetlocalschema works. But the following nodetool repair crashes. It has to be stopped and then re-started. Are there any suggestions for logging or similar so that we can get a clue next time this happens. Cheers, Edward On 12-08-24 11:18 AM, Edward Sargisson wrote: Sadly, I don't think we can get much. All I know about the repro is that it was around a node restart. I've just tried that and everything's fine. I see now ERROR level messages in the logs. Clearly, some other conditions are required but we don't know them as yet. Many thanks, Edward On 12-08-24 03:29 AM, aaron morton wrote: If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this
Re: Node forgets about most of its column families
Thanks Peter. This is 1.1.X ? Any thoughts on how recent the last schema change was ? Had the schema started in a pre 1.1X cluster? If so had their been a migration change after 1.1 upgrade? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 1:55 PM, Peter Schuller peter.schul...@infidyne.com wrote: I can confirm having seen this (no time to debug). One method of recovery is to jump the node back into the ring with auto_bootstrap set to false and an appropriate token set, after deleting system tables. That assumes you're willing to have the node take a few bad reads until you're able to disablegossip and make other nodes not send requests to it. disabling thrift would also be advised, or even firewalling it prior to restart. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Node forgets about most of its column families
Hi Aaron, Thanks for the reply. I've recorded what we know at https://issues.apache.org/jira/browse/CASSANDRA-4583. This includes log snippets from two of the nodes from around the time. I don't know what is relevant so they've got everything that was in the system log at the time of the failure and recovery. Nodetool crashed but not returning, having nothing appear in the logs and nodetool compactionstats and nodetool netstats indicating that nothing was happening. Thanks for your time looking at this. Cheers, Edward On 12-08-29 02:44 AM, aaron morton wrote: But the following nodetool repair crashes. It has to be stopped and then re-started. How did it crash ? Are there any suggestions for logging or similar so that we can get a clue next time this happens. Can you make the logs from #5 available? If you feel you can describe the situation please create a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 8:38 AM, Edward Sargisson edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net wrote: For the record, we just had a recurrence of this. This time, when the node (#5) came back it didn't properly rejoin the ring. We stopped every node and brought them back one by one to get the ring to link up correctly. Then, all the even nodes (#2, #4, #6) had out of data schemas. nodetool resetlocalschema works. But the following nodetool repair crashes. It has to be stopped and then re-started. Are there any suggestions for logging or similar so that we can get a clue next time this happens. Cheers, Edward On 12-08-24 11:18 AM, Edward Sargisson wrote: Sadly, I don't think we can get much. All I know about the repro is that it was around a node restart. I've just tried that and everything's fine. I see now ERROR level messages in the logs. Clearly, some other conditions are required but we don't know them as yet. Many thanks, Edward On 12-08-24 03:29 AM, aaron morton wrote: If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com http://www.thelastpickle.com/ On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is
Re: Node forgets about most of its column families
For those playing along at home Edwards ticket was marked as a dup of Problem with creating keyspace after drop https://issues.apache.org/jira/browse/CASSANDRA-4219 Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 30/08/2012, at 4:43 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: Hi Aaron, Thanks for the reply. I've recorded what we know at https://issues.apache.org/jira/browse/CASSANDRA-4583. This includes log snippets from two of the nodes from around the time. I don't know what is relevant so they've got everything that was in the system log at the time of the failure and recovery. Nodetool crashed but not returning, having nothing appear in the logs and nodetool compactionstats and nodetool netstats indicating that nothing was happening. Thanks for your time looking at this. Cheers, Edward On 12-08-29 02:44 AM, aaron morton wrote: But the following nodetool repair crashes. It has to be stopped and then re-started. How did it crash ? Are there any suggestions for logging or similar so that we can get a clue next time this happens. Can you make the logs from #5 available? If you feel you can describe the situation please create a ticket on https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 29/08/2012, at 8:38 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: For the record, we just had a recurrence of this. This time, when the node (#5) came back it didn't properly rejoin the ring. We stopped every node and brought them back one by one to get the ring to link up correctly. Then, all the even nodes (#2, #4, #6) had out of data schemas. nodetool resetlocalschema works. But the following nodetool repair crashes. It has to be stopped and then re-started. Are there any suggestions for logging or similar so that we can get a clue next time this happens. Cheers, Edward On 12-08-24 11:18 AM, Edward Sargisson wrote: Sadly, I don't think we can get much. All I know about the repro is that it was around a node restart. I've just tried that and everything's fine. I see now ERROR level messages in the logs. Clearly, some other conditions are required but we don't know them as yet. Many thanks, Edward On 12-08-24 03:29 AM, aaron morton wrote: If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent
Re: Node forgets about most of its column families
For the record, we just had a recurrence of this. This time, when the node (#5) came back it didn't properly rejoin the ring. We stopped every node and brought them back one by one to get the ring to link up correctly. Then, all the even nodes (#2, #4, #6) had out of data schemas. nodetool resetlocalschema works. But the following nodetool repair crashes. It has to be stopped and then re-started. Are there any suggestions for logging or similar so that we can get a clue next time this happens. Cheers, Edward On 12-08-24 11:18 AM, Edward Sargisson wrote: Sadly, I don't think we can get much. All I know about the repro is that it was around a node restart. I've just tried that and everything's fine. I see now ERROR level messages in the logs. Clearly, some other conditions are required but we don't know them as yet. Many thanks, Edward On 12-08-24 03:29 AM, aaron morton wrote: If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the
Re: Node forgets about most of its column families
I can confirm having seen this (no time to debug). One method of recovery is to jump the node back into the ring with auto_bootstrap set to false and an appropriate token set, after deleting system tables. That assumes you're willing to have the node take a few bad reads until you're able to disablegossip and make other nodes not send requests to it. disabling thrift would also be advised, or even firewalling it prior to restart. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)
Re: Node forgets about most of its column families
If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net 866.484.6630 New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about Global Relay Message — The Future of Collaboration in the Financial Services World All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners.
Re: Node forgets about most of its column families
Sadly, I don't think we can get much. All I know about the repro is that it was around a node restart. I've just tried that and everything's fine. I see now ERROR level messages in the logs. Clearly, some other conditions are required but we don't know them as yet. Many thanks, Edward On 12-08-24 03:29 AM, aaron morton wrote: If this is still a test environment can you try to reproduce the fault ? Or provide some more details on the sequence of events? If you still have the logs around can you see if any ERROR level messages were logged? Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 24/08/2012, at 8:33 AM, Edward Sargisson edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net wrote: Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners. -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*— *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay’s email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners.
Re: Node forgets about most of its column families
On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- =Robert Coli AIMGTALK - rc...@palominodb.com YAHOO - rcoli.palominob SKYPE - rcoli_palominodb
Re: Node forgets about most of its column families
Ah, yes, I forgot that bit thanks! 1.1.2 running on Centos. Running nodetool resetlocalschema then nodetool repair fixed the problem but not understanding what happened is a concern. Cheers, Edward On 12-08-23 12:40 PM, Rob Coli wrote: On Thu, Aug 23, 2012 at 11:47 AM, Edward Sargisson edward.sargis...@globalrelay.net wrote: I was wondering if anybody had seen the following behaviour before and how we might detect it and keep the application running. I don't know the answer to your problem, but anyone who does will want to know in what version of Cassandra you are encountering this issue. :) =Rob -- Edward Sargisson senior java developer Global Relay edward.sargis...@globalrelay.net mailto:edward.sargis...@globalrelay.net *866.484.6630* New York | Chicago | Vancouver | London (+44.0800.032.9829) | Singapore (+65.3158.1301) Global Relay Archive supports email, instant messaging, BlackBerry, Bloomberg, Thomson Reuters, Pivot, YellowJacket, LinkedIn, Twitter, Facebook and more. Ask about *Global Relay Message* http://www.globalrelay.com/services/message*--- *The Future of Collaboration in the Financial Services World * *All email sent to or from this address will be retained by Global Relay's email archiving system. This message is intended only for the use of the individual or entity to which it is addressed, and may contain information that is privileged, confidential, and exempt from disclosure under applicable law. Global Relay will not be liable for any compliance or technical information provided herein. All trademarks are the property of their respective owners.