[jira] [Updated] (CASSANDRA-16047) Potential race condition in creating hard link when incremental backup is turned on

Wei Deng (Jira) Thu, 13 Aug 2020 09:09:20 -0700


     [ 
https://issues.apache.org/jira/browse/CASSANDRA-16047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wei Deng updated CASSANDRA-16047:
---------------------------------
    Description: 
It seems that there is a race condition in creating hard link if incremental 
backup is turned on.

The following screenshot was captured in a production cluster running Cassandra 
3.0.15 after turning on incremental backup. When this {{NoSuchFileException}} 
happens, due to the {{FSWriteError}} and the default disk failure policy, the 
JVM will be shutdown, so it's a pretty critical bug.
 !incremental_backup_hardlink_exception.jpg!

Due to the risk of causing production database downtime (if similar issue 
happens on multiple nodes in a short time frame), and same exception causing 
JVM shutdown multiple times already, incremental backup had to be turned off 
for now, but this is not an ideal situation.

!incremental_backup_hardlink_exception1.jpg!

The deployment is on a public cloud environment with EBS-like disks that are 
backed by SSD with decent latency, throughput and IOPS, so it is hard to think 
the culprit being in the OS and IO layer. Based on the second screenshot above, 
this is a low flush traffic {{system.size_estimates}} table, so compaction of 
the source SSTable doesn't seem to be at play here.

  was:
It seems that there is a race condition in creating hard link if incremental 
backup is turned on.

The following screenshot was captured in a production cluster running Cassandra 
3.0.15 after turning on incremental backup. When this {{NoSuchFileException}} 
happens, due to the {{FSWriteError}} and the default disk failure policy, the 
JVM will be shutdown, so it's a pretty critical bug.
 !incremental_backup_hardlink_exception.jpg!

Due to the risk of causing production database downtime (if similar issue 
happens on multiple nodes in a short time frame), and similar issue causing JVM 
shutdown multiple times already, incremental backup had to be turned off for 
now, but this is not an ideal situation.

!incremental_backup_hardlink_exception1.jpg!

The deployment is on a public cloud environment with EBS-like disks that are 
backed by SSD with decent latency, throughput and IOPS, so it is hard to think 
the culprit being in the OS and IO layer. Based on the second screenshot above, 
this is a low flush traffic {{system.size_estimates}} table, so compaction of 
the source SSTable doesn't seem to be at play here.


> Potential race condition in creating hard link when incremental backup is 
> turned on
> -----------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16047
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16047
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Wei Deng
>            Priority: Urgent
>         Attachments: incremental_backup_hardlink_exception.jpg, 
> incremental_backup_hardlink_exception1.jpg
>
>
> It seems that there is a race condition in creating hard link if incremental 
> backup is turned on.
> The following screenshot was captured in a production cluster running 
> Cassandra 3.0.15 after turning on incremental backup. When this 
> {{NoSuchFileException}} happens, due to the {{FSWriteError}} and the default 
> disk failure policy, the JVM will be shutdown, so it's a pretty critical bug.
>  !incremental_backup_hardlink_exception.jpg!
> Due to the risk of causing production database downtime (if similar issue 
> happens on multiple nodes in a short time frame), and same exception causing 
> JVM shutdown multiple times already, incremental backup had to be turned off 
> for now, but this is not an ideal situation.
> !incremental_backup_hardlink_exception1.jpg!
> The deployment is on a public cloud environment with EBS-like disks that are 
> backed by SSD with decent latency, throughput and IOPS, so it is hard to 
> think the culprit being in the OS and IO layer. Based on the second 
> screenshot above, this is a low flush traffic {{system.size_estimates}} 
> table, so compaction of the source SSTable doesn't seem to be at play here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (CASSANDRA-16047) Potential race condition in creating hard link when incremental backup is turned on

Reply via email to