[jira] [Updated] (IGNITE-8676) Possible data loss after stoping/starting several nodes at the same time

Andrey Aleksandrov (JIRA) Fri, 01 Jun 2018 03:28:16 -0700


     [ 
https://issues.apache.org/jira/browse/IGNITE-8676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andrey Aleksandrov updated IGNITE-8676:
---------------------------------------
    Description: 
Steps to reproduce:

1)Start 3 data (DN1, DN2, DN3) nodes with the configuration that contains the 
cache with node filter for only these three nodes and 1 backup. (see 
configuration from attachment)
 2)Activate the cluster. Now you should have 3 nodes in BLT
 3)Start new server node (SN). Now you should have 3 nodes in BLT and 1 node 
not in the baseline.
 4)Using some node load about 10000 (or more) entities into the cache.
 5)Start that number of primary partitions equals to backup partitions.

!image-2018-06-01-12-34-54-320.png!
 6)Now stop DN3 and SN. After that start them at the same time.
 7)When DN3 and SN will be online, check that number of primary partitions (PN) 
equals to backup partitions (BN).

7.1)In a case if PN == BN => go to step 6)
 7.2)In a case if PN != BN => go to step 8)

 

!image-2018-06-01-13-12-47-218.png!

8)Deactivate the cluster with control.sh.
 9)Activate the cluster with control.sh.

Not you should see the data loss.

!image-2018-06-01-13-15-17-437.png!

Notes:
 1)Stops/Starts should be done at the same time
 2)Consistent Ids for nodes should be constant.

Not you should see the data loss.

Also, I provide the reproducer that often possible to reproduce this issue (not 
always).  Free the working directory and restart reproducer in case if there is 
no data loss in this iteration.

  was:
Steps to reproduce:

1)Start 3 data (DN1, DN2, DN3) nodes with the configuration that contains the 
cache with node filter for only these three nodes and 1 backup. (see 
configuration from attachment)
2)Activate the cluster. Now you should have 3 nodes in BLT
3)Start new server node (SN). Now you should have 3 nodes in BLT and 1 node not 
in the baseline.
4)Using some node load about 10000 (or more) entities into the cache.
5)Start visor and check that number of primary partitions equals to backup 
partitions.

!image-2018-06-01-12-34-54-320.png!
6)Now stop DN3 and SN. After that start them at the same time.
7)When DN3 and SN will be online, check that number of primary partitions (PN) 
equals to backup partitions (BN).

7.1)In a case if PN == BN => go to step 6)
7.2)In a case if PN != BN => go to step 8)

 

!image-2018-06-01-13-12-47-218.png!

8)Deactivate the cluster with control.sh.
9)Activate the cluster with control.sh.

Not you should see the data loss.

!image-2018-06-01-13-15-17-437.png!

Notes:
1)Stops/Starts should be done at the same time
2)Consistent Ids for nodes should be constant.



Not you should see the data loss.

Also, I provide the reproducer that often possible to reproduce this issue (not 
always).  Free the working directory and restart reproducer in case if there is 
no data loss in this iteration.


> Possible data loss after stoping/starting several nodes at the same time
> ------------------------------------------------------------------------
>
>                 Key: IGNITE-8676
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8676
>             Project: Ignite
>          Issue Type: Bug
>          Components: persistence
>    Affects Versions: 2.4
>            Reporter: Andrey Aleksandrov
>            Priority: Critical
>             Fix For: 2.6
>
>         Attachments: DataLossTest.zip, image-2018-06-01-12-34-54-320.png, 
> image-2018-06-01-13-12-47-218.png, image-2018-06-01-13-15-17-437.png
>
>
> Steps to reproduce:
> 1)Start 3 data (DN1, DN2, DN3) nodes with the configuration that contains the 
> cache with node filter for only these three nodes and 1 backup. (see 
> configuration from attachment)
>  2)Activate the cluster. Now you should have 3 nodes in BLT
>  3)Start new server node (SN). Now you should have 3 nodes in BLT and 1 node 
> not in the baseline.
>  4)Using some node load about 10000 (or more) entities into the cache.
>  5)Start that number of primary partitions equals to backup partitions.
> !image-2018-06-01-12-34-54-320.png!
>  6)Now stop DN3 and SN. After that start them at the same time.
>  7)When DN3 and SN will be online, check that number of primary partitions 
> (PN) equals to backup partitions (BN).
> 7.1)In a case if PN == BN => go to step 6)
>  7.2)In a case if PN != BN => go to step 8)
>  
> !image-2018-06-01-13-12-47-218.png!
> 8)Deactivate the cluster with control.sh.
>  9)Activate the cluster with control.sh.
> Not you should see the data loss.
> !image-2018-06-01-13-15-17-437.png!
> Notes:
>  1)Stops/Starts should be done at the same time
>  2)Consistent Ids for nodes should be constant.
> Not you should see the data loss.
> Also, I provide the reproducer that often possible to reproduce this issue 
> (not always).  Free the working directory and restart reproducer in case if 
> there is no data loss in this iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (IGNITE-8676) Possible data loss after stoping/starting several nodes at the same time

Reply via email to