On 6/3/13 9:54 AM, "Flavio Junqueira" <[email protected]> wrote:
>On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[email protected]> wrote: > >> From my understanding, ZooKeeper currently maintains data integrity by >> validating all the data before loading it in to memory. Disk-related >> errors on one of the machine won't affect the correctness of the >>ensemble >> since we are serving client or peer request from in-memory data only. > >Let me try to be a bit more concrete. Say that we corrupt arbitrarily a >txn T in a log file, and that T has been acknowledged by 3 servers (S1, >S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume >that S3 has corrupted T in its log. Next say that S5 becomes the leader >supported by S3 and S4 (S3 has restarted). We can elect S5 because it has >the same history as S3 and S3 has corrupted T (we ignore any transaction >it may have after T), which S5 doesn't have. If this can happen, then we >lost T even though T has been acknowledged by a quorum. > >In any case, I'm interested in defining precisely what integrity >guarantees we provide for txn logs/snapshots. The point I was trying to >convey is that we can't tolerate arbitrary corruptions of the txn log. We >can only tolerate (and I'm not convinced there is a reason to push it >further) corruption of a suffix of the txn log that has not been >acknowledged and the txns in this suffix haven't been acknowledged >because the server crashed before they have been completely flushed to >disk. I believe the problem you are describing here is essentially the fact that we have more failure than we can tolerate. Ideally, if S1 or S2 participated in the next round of leader election, S1 or S2 should be elected as a leader because they have the highest zxid. S3 has txnlog corruption at T so it should reports its zxid as T-1 during leader election. Because of how leader election works, a corruption in less than a majority should not affect the correctness. However, in ZK-1413,ZK-22,ZK-1416, a server use its local txnlog to response to a request. So they are vulnerable to a single machine disk corruption or operator error. However, it won't affect correctness if we can detect the corruption correctly. > >> >> However, in ZK-1413. The leader use on-disk txnlog to synchronize with >>the >> learner. It seem like we have to keep checking txnlog integrity every >>time >> we read something from disk. And I don't think integrity check is cheap >> too since we have to scan the entire history (starting from a given >>zxid). > >For the average case, this might not be too bad. If I remember correctly, >it is possible to calibrate the amount of transactions a server is >willing to read from disk when deciding whether to send a snapshot. > >> >> If we cache txnlog in memory, we only need to do integrity check once >>and >> we can also built some indexes on top of it to support more efficient >> lookup. However, this is going to consume a lot of memory. >> > >Agreed, although I'd rather generate a few numbers before we claim it is >bad and that we need a cache For 1413, the current implementation works fine if the parameters are configured appropriately. I mentioned caching because other features like ZK-22 or ZK-1416 might need this. If we ever need to modify txnlog facility we can think of way to solve problems for other features has well. > >> On the other hand, these features (ZK-1413,ZK-22,ZK-1416) don't really >> need the entire txnlog to be valid. The server can always say to the >> client that the history needed to answer the request is too old and >>there >> is fall back mechanism that allows system to make progress correctly. >> From example, in ZK-1413, the leader can fall back to send a snapshot to >> the learner if it cannot use txnlog due to any reason. > >Sure, this covers some cases, but I don't see how it covers the case >above. I think it doesn't, right? > >-Flavio > >> >> >> -- >> Thawan Kooburat >> >> >> >> >> >> On 6/1/13 8:18 AM, "Flavio Junqueira" <[email protected]> wrote: >> >>> I think this discussion has been triggered by a discussion we have had >>> for ZOOKEEPER-1413. In the patch Thawan proposed there, there was a >>> method reads txn logs and it simply logs an error in the case of an >>> exception while reading the log. I raised the question of whether we >>> should do more than simply logging an error message and the discussion >>> about txn log started, but it seems to be a discussion that is out of >>>the >>> scope of 1413, so we thought it would be good to have this discussion >>> separately, >>> >>> Here are a few thoughts about the issue. We can't really tolerate >>> arbitrary corruptions of the txn log because it could imply that we >>>lose >>> quorum for a txn that has been processed and a response has been >>>returned >>> to the client. In the case that a faulty server only partially writes a >>> txn into a txn log because it crashes, the logged txn is corrupt, but >>>we >>> don't really have an issue because the server has not acked the txn, so >>> if there is a quorum for that txn, the faulty server is not really part >>> of it. Cases like this I believe we can do something about, but more >>> generally taking care of txn log integrity sounds like a hard problem. >>> >>> -Flavio >>> >>> >>> On Jun 1, 2013, at 4:29 PM, Camille Fournier <[email protected]> >>>wrote: >>> >>>> I think it's an interesting idea certainly worth discussing. Do you >>>>have >>>> any proposals for how we might modify? What should we think about wrt >>>> migration/backwards compatibility? >>>> >>>> C >>>> >>>> >>>> On Fri, May 31, 2013 at 8:26 PM, Thawan Kooburat <[email protected]> >>>>wrote: >>>> >>>>> Hi, >>>>> >>>>> I just want to start a discussion about the usage of txnlog. Here is >>>>> the >>>>> list of features that need to lookup information from txnlog. Theses >>>>> feature need to ensure the integrity of txnlog and having an >>>>>efficient >>>>> lookup is good for performance as well. >>>>> >>>>> ZOOKEEPER-1413 - The leader use txnlog to synchronize with the >>>>> learners.It need to read txnlog in sequential manner starting from a >>>>> given >>>>> zxid. >>>>> ZOOKEEPER-22 The design proposal mentioned that the leader should >>>>> lookup >>>>> txnlog to response to the client if a request is accepted by the >>>>> client or >>>>> not. The server need to lookup txn by sessionId and cxid >>>>> ZOOKEEPER-1416 The server need to be able to tell the list of >>>>>deleted >>>>> nodes starting a given zxid. One possible implementation is to walk >>>>> txnlog >>>>> staring from a given zxid and look for delete txn. >>>>> >>>>> Do we need to change the way we store txnlog so that we can ensure >>>>> integrity and more efficient lookup? >>>>> >>>>> -- >>>>> Thawan Kooburat >>>>> >>> >> >
