[ 
https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738962#comment-16738962
 ] 

maxwellguo commented on CASSANDRA-14957:
----------------------------------------

as I know , when you modify the schema(as you said, a new version for it ? did 
you first drop and then recreate the table?), the schema change will first 
write locally and then send message to other alive nodes, other nodes will also 
write down the migration change locally ,also they will annouce through gossip. 
for you said your write is ok, you restart the cluster(not upgrade the c* 
version) and make a new version for the table, and write is ok ,but there exist 
different version for the same table on the same node. when you make a write , 
for the table exist already so write is ok ? Is the description correct? so can 
you add other information ,like jstack for the error node 

> Rolling Restart Of Nodes Cause Dataloss Due To Schema Collision
> ---------------------------------------------------------------
>
>                 Key: CASSANDRA-14957
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-14957
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Cluster/Schema
>            Reporter: Avraham Kalvo
>            Priority: Major
>
> We were issuing a rolling restart on a mission-critical five node C* cluster.
> The first node which was restarted got the following messages in its 
> system.log:
> ```
> January 2nd 2019, 12:06:37.310 - INFO 12:06:35 Initializing 
> tasks_scheduler_external.tasks
> ```
> ```
> WARN 12:06:39 UnknownColumnFamilyException reading from socket; closing
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId bd7200a0-1567-11e8-8974-855d74ee356f. If a table was just created, this 
> is likely due to the schema not being fully propagated. Please wait for 
> schema agreement on table creation.
> at 
> org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1336)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize30(PartitionUpdate.java:660)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.partitions.PartitionUpdate$PartitionUpdateSerializer.deserialize(PartitionUpdate.java:635)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:349)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:286)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at org.apache.cassandra.net.MessageIn.read(MessageIn.java:98) 
> ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:201)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:178)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> at 
> org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:92)
>  ~[apache-cassandra-3.0.10.jar:3.0.10]
> ```
> The latter was then repeated several times across the cluster.
> It was then found out that the table in question 
> `tasks_scheduler_external.tasks` was created with a new schema version 
> sometime along the entire cluster consecutive restart and became available 
> once the schema agreement settled, which started taking requests leaving the 
> previous version of the schema unavailable for any request, thus generating a 
> data loss to our online system.
> Data loss was recovered by manually copying SSTables from the previous 
> version directory of the schema to the new one followed by `nodetool refresh` 
> to the relevant table.
> The above has repeated itself for several tables across various keyspaces.
> One other thing to mention is that a repair was in place for the first node 
> to be restarted, which was obviously stopped as the daemon was shut down, but 
> this doesn't seem to do with the above at first glance.
> Seems somewhat related to:
> https://issues.apache.org/jira/browse/CASSANDRA-13559



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to