[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-21 Thread Tommy Stendahl (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15294889#comment-15294889
 ] 

Tommy Stendahl commented on CASSANDRA-11742:


No, I have no objection. You're approach seams to be the best one.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-20 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293873#comment-15293873
 ] 

Joel Knighton commented on CASSANDRA-11742:
---

[~tommy_s] Any reason you're opposed to switching to the approach I described 
above? It would guarantee that a crash during start up wouldn't leave the 
system keyspace in a failing health check state, as opposed to only shrinking 
the window. If that's fine with you, I can switch myself to assignee and push 
that branch for CI.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-20 Thread Sam Tunnicliffe (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15293104#comment-15293104
 ] 

Sam Tunnicliffe commented on CASSANDRA-11742:
-

bq.  Can you think of any reason we can't just call 
SystemKeyspace.persistLocalMetadata immediately after snapshotting the system 
keyspace in CassandraDaemon?
That sounds entirely reasonable to me.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-19 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292309#comment-15292309
 ] 

Joel Knighton commented on CASSANDRA-11742:
---

I think this second patch is an improvement - I traced this issue to determine 
exactly why it worked on 2.1. This behavior was introduced by [CASSANDRA-8049] 
which centralized Cassandra startup checks. Prior to this change, we inserted 
cluster name directly after checking the health of the system keyspace, so if 
an sstable for the system keyspace was flushed, we could guarantee that some 
sstable contained cluster name. After [CASSANDRA-8049], we insert cluster name 
with the rest of the local metadata in {{SystemKeyspace.finishStartup}}.

[~beobal] - I couldn't find a reason for the change as to when cluster name is 
inserted other than that it didn't seem like a good idea to mutate anything in 
a startup check. Can you think of any reason we can't just call 
{{SystemKeyspace.persistLocalMetadata}} immediately after snapshotting the 
system keyspace in {{CassandraDaemon}}? The root cause of this problem is that 
we need the data persisted before any truncate/schema logic, since these will 
write to the system keyspace, so we can have flushed sstables with this data 
but no sstable with cluster name, which breaks the logic of the system keyspace 
health check. I ran full unit tests/dtests on a branch that moved 
{{SystemKeyspace.persistLocalMetadata}} to immediately after the snapshot of 
the system keyspace and the results looked good.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-19 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15292308#comment-15292308
 ] 

Joel Knighton commented on CASSANDRA-11742:
---

I think this second patch is an improvement - I traced this issue to determine 
exactly why it worked on 2.1. This behavior was introduced by [CASSANDRA-8049] 
which centralized Cassandra startup checks. Prior to this change, we inserted 
cluster name directly after checking the health of the system keyspace, so if 
an sstable for the system keyspace was flushed, we could guarantee that some 
sstable contained cluster name. After [CASSANDRA-8049], we insert cluster name 
with the rest of the local metadata in {{SystemKeyspace.finishStartup()}}.

[~beobal] - I couldn't find a reason for the change as to when cluster name is 
inserted other than that it didn't seem like a good idea to mutate anything in 
a startup check. Can you think of any reason we can't just call 
{{SystemKeyspace.persistLocalMetadata}} immediately after snapshotting the 
system keyspace in {{CassandraDaemon}}? The root cause of this problem is that 
we need the data persisted before any truncate/schema logic, since these will 
write to the system keyspace, so we can have flushed sstables with this data 
but no sstable with cluster name, which breaks the logic of the system keyspace 
health check. I ran full unit tests/dtests on a branch that moved 
{{SystemKeyspace.persistLocalMetadata}} to immediately after the snapshot of 
the system keyspace and the results looked good.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-17 Thread Tommy Stendahl (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15286633#comment-15286633
 ] 

Tommy Stendahl commented on CASSANDRA-11742:


I looked in to this again and done some tests and a blocking flush in 
{{persistLocalMetadata}} seams to be a better way to do it. I have done a new 
patch for that.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Fix For: 2.2.x, 3.0.x, 3.x
>
> Attachments: 11742-2.txt, 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-16 Thread Tommy Stendahl (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15284249#comment-15284249
 ] 

Tommy Stendahl commented on CASSANDRA-11742:


I see what you mean, that should close the reaming window. Even if that window 
is small it would be good if we can avoid it  I will do some tests with a 
blocking flush in {{persistLocalMetadata}}, if it works out I will post a new 
patch.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Attachments: 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-11742) Failed bootstrap results in exception when node is restarted

2016-05-13 Thread Joel Knighton (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-11742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15283140#comment-15283140
 ] 

Joel Knighton commented on CASSANDRA-11742:
---

I've confirmed this issue - there's a window between 
{{SystemKeyspace.finishStartup()}} and calling {{Gossiper.instance.start()}} in 
{{prepareToJoin}} where the contents will only be in the memtable/commitlog and 
not flushed.

There doesn't seem a clearly better way to refactor 
{{SystemKeyspace.checkHealth()}} - ideally, we wouldn't write to {{local}} 
before {{finishStartup}}, but that would be a significant refactor without a 
very significant reward. Even with your proposed fix, there's a window where we 
could crash before we even attempt to write in 
{[SystemKeyspace.finishStartup()}}, but I think that's livable.

As an alternative to your patch, [~tommy_s], how would you feel about just 
forcing a blocking flush in {{persistLocalMetadata}}? This would ensure the 
data is present even if we have a hard crash/kill circumstance where 
{{StorageServiceShutdownHook}} doesn't run.

> Failed bootstrap results in exception when node is restarted
> 
>
> Key: CASSANDRA-11742
> URL: https://issues.apache.org/jira/browse/CASSANDRA-11742
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Tommy Stendahl
>Assignee: Tommy Stendahl
>Priority: Minor
> Attachments: 11742.txt
>
>
> Since 2.2 a failed bootstrap results in a 
> {{org.apache.cassandra.exceptions.ConfigurationException: Found system 
> keyspace files, but they couldn't be loaded!}} exception when the node is 
> restarted. This did not happen in 2.1, it just tried to bootstrap again. I 
> know that the workaround is relatively easy, just delete the system keyspace 
> in the data folder on disk and try again, but its a bit annoying that you 
> have to do that.
> The problem seems to be that the creation of the {{system.local}} table has 
> been moved to just before the bootstrap begins (in 2.1 it was done much 
> earlier) and as a result its still in the memtable och commitlog if the 
> bootstrap failes. Still a few values is inserted to the {{system.local}} 
> table at an earlier point in the startup and they have been flushed from the 
> memtable to an sstable. When the node is restarted the 
> {{SystemKeyspace.checkHealth()}} is executed before the commitlog is replayed 
> and therefore only see the sstable with an incomplete {{system.local}} table 
> and throws an exception.
> I think we could fix this very easily by forceFlush the system keyspace in 
> the {{StorageServiceShutdownHook}}, I have included a patch that does this. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)