[jira] [Commented] (ZOOKEEPER-3333) Detect if txnlogs and / or snapshots is deleted under a running ZK instance

2019-03-28 Thread Fangmin Lv (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16804121#comment-16804121
 ] 

Fangmin Lv commented on ZOOKEEPER-:
---

[~nkalmar] internally, we're using the real time consistency check upstreaming 
in ZOOKEEPER-3114 to detect partially deleted txns or data from disk. 

For all the data being deleted, as long as the traffic is going on and we wrote 
another snapshot and txns, it should be fine. But we may not detect this in 
time in case the snapshot txn threshold is large.

Internally, we also do periodically backup to avoid this kind of disaster in 
case we lost all data due to bad operation or other disaster.

> Detect if txnlogs and / or snapshots is deleted under a running ZK instance
> ---
>
> Key: ZOOKEEPER-
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.5.5, 3.4.14
>Reporter: Norbert Kalmar
>Priority: Major
>
> ZK does not notice if txnlogs are deleted from it's dataDir, and it will just 
> keep running, writing txns in the buffer. Than, when ZK is restarted, it will 
> lose all data.
> To reproduce:
> I run a 3 node ZK ensemble, and deleted dataDir for just one instance, than 
> wrote some data. It turns out, it will not write the transaction to disk. ZK 
> stores everything in memory, until it “feels like” it’s time to persist it on 
> disk. So it doesn’t even notice the file is deleted, and when it tried to 
> flush, I imagine it just fails and keeps it in the buffer. 
> So anyway, I restarted the instance, it got the snapshot + latest txn logs 
> from the other nodes, as expected it would. It also wrote them in dataDir, so 
> now every node had the dataDir.
> So deleting from one node is fine (again, as expected, they will sync after a 
> restart).
> Then, I deleted all 3 nodes dataDir under running instances. Until restart, 
> it worked fine (of course I was getting my buffer full, I did not test until 
> the point it got overflowed).
> But after restart, I got a fresh new ZK with all my znodes gone.
> For starter, I think ZK should detect if the file it is appending is removed. 
> What should ZK do? At least give a warning log message. The question should 
> it try to create a new file? Or try to get it from other nodes? Or just fail 
> instantly? Restart itself, see if it can sync?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3333) Detect if txnlogs and / or snapshots is deleted under a running ZK instance

2019-03-27 Thread Norbert Kalmar (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16802980#comment-16802980
 ] 

Norbert Kalmar commented on ZOOKEEPER-:
---

Thanks for your input Brian!

I would say then ZK should log the warning, do a snapshot and start a new txn 
log. And make sure we do not lose any txns along the way.


> Detect if txnlogs and / or snapshots is deleted under a running ZK instance
> ---
>
> Key: ZOOKEEPER-
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.5.5, 3.4.14
>Reporter: Norbert Kalmar
>Priority: Major
>
> ZK does not notice if txnlogs are deleted from it's dataDir, and it will just 
> keep running, writing txns in the buffer. Than, when ZK is restarted, it will 
> lose all data.
> To reproduce:
> I run a 3 node ZK ensemble, and deleted dataDir for just one instance, than 
> wrote some data. It turns out, it will not write the transaction to disk. ZK 
> stores everything in memory, until it “feels like” it’s time to persist it on 
> disk. So it doesn’t even notice the file is deleted, and when it tried to 
> flush, I imagine it just fails and keeps it in the buffer. 
> So anyway, I restarted the instance, it got the snapshot + latest txn logs 
> from the other nodes, as expected it would. It also wrote them in dataDir, so 
> now every node had the dataDir.
> So deleting from one node is fine (again, as expected, they will sync after a 
> restart).
> Then, I deleted all 3 nodes dataDir under running instances. Until restart, 
> it worked fine (of course I was getting my buffer full, I did not test until 
> the point it got overflowed).
> But after restart, I got a fresh new ZK with all my znodes gone.
> For starter, I think ZK should detect if the file it is appending is removed. 
> What should ZK do? At least give a warning log message. The question should 
> it try to create a new file? Or try to get it from other nodes? Or just fail 
> instantly? Restart itself, see if it can sync?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ZOOKEEPER-3333) Detect if txnlogs and / or snapshots is deleted under a running ZK instance

2019-03-25 Thread Brian Nixon (JIRA)


[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16801160#comment-16801160
 ] 

Brian Nixon commented on ZOOKEEPER-:


I'm assuming that the scenario in mind is a rogue process on your host that is 
deleting files. One thing to note is that until ZOOKEEPER-3318 is completed to 
a reasonable state, people may outsource their backups to an external process - 
in which case the transaction log files and the snapshot files may have their 
lifecycle controlled by something that is not ZooKeeper (and ZooKeeper should 
not die when files disappear).

 

Having a message logged when a .snap or .log file is unexpectedly changed seems 
reasonable. Could also enable a feature by which the deletion of transaction 
logs triggers a snapshot to make sure the data tree would survive a sudden 
restart. I would not kill the server when a transaction log disappears since 
that would remove your one known copy of the data tree (in memory).

 

To implement this, you may be able to reuse the FileChangeWatcher that was 
added for the TLS work or at least copy from its approach.

> Detect if txnlogs and / or snapshots is deleted under a running ZK instance
> ---
>
> Key: ZOOKEEPER-
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-
> Project: ZooKeeper
>  Issue Type: Improvement
>  Components: server
>Affects Versions: 3.6.0, 3.5.5, 3.4.14
>Reporter: Norbert Kalmar
>Priority: Major
>
> ZK does not notice if txnlogs are deleted from it's dataDir, and it will just 
> keep running, writing txns in the buffer. Than, when ZK is restarted, it will 
> lose all data.
> To reproduce:
> I run a 3 node ZK ensemble, and deleted dataDir for just one instance, than 
> wrote some data. It turns out, it will not write the transaction to disk. ZK 
> stores everything in memory, until it “feels like” it’s time to persist it on 
> disk. So it doesn’t even notice the file is deleted, and when it tried to 
> flush, I imagine it just fails and keeps it in the buffer. 
> So anyway, I restarted the instance, it got the snapshot + latest txn logs 
> from the other nodes, as expected it would. It also wrote them in dataDir, so 
> now every node had the dataDir.
> So deleting from one node is fine (again, as expected, they will sync after a 
> restart).
> Then, I deleted all 3 nodes dataDir under running instances. Until restart, 
> it worked fine (of course I was getting my buffer full, I did not test until 
> the point it got overflowed).
> But after restart, I got a fresh new ZK with all my znodes gone.
> For starter, I think ZK should detect if the file it is appending is removed. 
> What should ZK do? At least give a warning log message. The question should 
> it try to create a new file? Or try to get it from other nodes? Or just fail 
> instantly? Restart itself, see if it can sync?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)