The two key points I can extract from this discussion and please feel free to add to my list are:
- We can't tolerate arbitrary corruption of log entries. We can tolerate corruption of a log suffix due to a crash, in which case the txns in the log have not been acknowledged. - Verifying digests is possibly expensive, so we might need to look at ways to avoid the performance penalty, like caching txns in memory. One more comment below: On Jun 4, 2013, at 9:22 PM, Thawan Kooburat <[email protected]> wrote: > > On 6/3/13 9:54 AM, "Flavio Junqueira" <[email protected]> wrote: > >> On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[email protected]> wrote: >> >>> From my understanding, ZooKeeper currently maintains data integrity by >>> validating all the data before loading it in to memory. Disk-related >>> errors on one of the machine won't affect the correctness of the >>> ensemble >>> since we are serving client or peer request from in-memory data only. >> >> Let me try to be a bit more concrete. Say that we corrupt arbitrarily a >> txn T in a log file, and that T has been acknowledged by 3 servers (S1, >> S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume >> that S3 has corrupted T in its log. Next say that S5 becomes the leader >> supported by S3 and S4 (S3 has restarted). We can elect S5 because it has >> the same history as S3 and S3 has corrupted T (we ignore any transaction >> it may have after T), which S5 doesn't have. If this can happen, then we >> lost T even though T has been acknowledged by a quorum. >> >> In any case, I'm interested in defining precisely what integrity >> guarantees we provide for txn logs/snapshots. The point I was trying to >> convey is that we can't tolerate arbitrary corruptions of the txn log. We >> can only tolerate (and I'm not convinced there is a reason to push it >> further) corruption of a suffix of the txn log that has not been >> acknowledged and the txns in this suffix haven't been acknowledged >> because the server crashed before they have been completely flushed to >> disk. > > I believe the problem you are describing here is essentially the fact that > we have more failure than we can tolerate. Ideally, if S1 or S2 > participated in the next round of leader election, S1 or S2 should be > elected as a leader because they have the highest zxid. S3 has txnlog > corruption at T so it should reports its zxid as T-1 during leader > election. > > > Because of how leader election works, a corruption in less than a majority > should not affect the correctness. However, in ZK-1413,ZK-22,ZK-1416, a > server use its local txnlog to response to a request. So they are > vulnerable to a single machine disk corruption or operator error. However, > it won't affect correctness if we can detect the corruption correctly. In the case I described above, there was a corruption only in one server and yet it caused a problematic scenario. I don't think we can claim that we can tolerate corruption of a minority. One single corruption might be problematic already. > >> >>> >>> However, in ZK-1413. The leader use on-disk txnlog to synchronize with >>> the >>> learner. It seem like we have to keep checking txnlog integrity every >>> time >>> we read something from disk. And I don't think integrity check is cheap >>> too since we have to scan the entire history (starting from a given >>> zxid). >> >> For the average case, this might not be too bad. If I remember correctly, >> it is possible to calibrate the amount of transactions a server is >> willing to read from disk when deciding whether to send a snapshot. >> >>> >>> If we cache txnlog in memory, we only need to do integrity check once >>> and >>> we can also built some indexes on top of it to support more efficient >>> lookup. However, this is going to consume a lot of memory. >>> >> >> Agreed, although I'd rather generate a few numbers before we claim it is >> bad and that we need a cache > > For 1413, the current implementation works fine if the parameters are > configured appropriately. I mentioned caching because other features like > ZK-22 or ZK-1416 might need this. If we ever need to modify txnlog > facility we can think of way to solve problems for other features has > well. > >> >>> On the other hand, these features (ZK-1413,ZK-22,ZK-1416) don't really >>> need the entire txnlog to be valid. The server can always say to the >>> client that the history needed to answer the request is too old and >>> there >>> is fall back mechanism that allows system to make progress correctly. >>> From example, in ZK-1413, the leader can fall back to send a snapshot to >>> the learner if it cannot use txnlog due to any reason. >> >> Sure, this covers some cases, but I don't see how it covers the case >> above. I think it doesn't, right? > > > > > > > >> >> -Flavio >> >>> >>> >>> -- >>> Thawan Kooburat >>> >>> >>> >>> >>> >>> On 6/1/13 8:18 AM, "Flavio Junqueira" <[email protected]> wrote: >>> >>>> I think this discussion has been triggered by a discussion we have had >>>> for ZOOKEEPER-1413. In the patch Thawan proposed there, there was a >>>> method reads txn logs and it simply logs an error in the case of an >>>> exception while reading the log. I raised the question of whether we >>>> should do more than simply logging an error message and the discussion >>>> about txn log started, but it seems to be a discussion that is out of >>>> the >>>> scope of 1413, so we thought it would be good to have this discussion >>>> separately, >>>> >>>> Here are a few thoughts about the issue. We can't really tolerate >>>> arbitrary corruptions of the txn log because it could imply that we >>>> lose >>>> quorum for a txn that has been processed and a response has been >>>> returned >>>> to the client. In the case that a faulty server only partially writes a >>>> txn into a txn log because it crashes, the logged txn is corrupt, but >>>> we >>>> don't really have an issue because the server has not acked the txn, so >>>> if there is a quorum for that txn, the faulty server is not really part >>>> of it. Cases like this I believe we can do something about, but more >>>> generally taking care of txn log integrity sounds like a hard problem. >>>> >>>> -Flavio >>>> >>>> >>>> On Jun 1, 2013, at 4:29 PM, Camille Fournier <[email protected]> >>>> wrote: >>>> >>>>> I think it's an interesting idea certainly worth discussing. Do you >>>>> have >>>>> any proposals for how we might modify? What should we think about wrt >>>>> migration/backwards compatibility? >>>>> >>>>> C >>>>> >>>>> >>>>> On Fri, May 31, 2013 at 8:26 PM, Thawan Kooburat <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I just want to start a discussion about the usage of txnlog. Here is >>>>>> the >>>>>> list of features that need to lookup information from txnlog. Theses >>>>>> feature need to ensure the integrity of txnlog and having an >>>>>> efficient >>>>>> lookup is good for performance as well. >>>>>> >>>>>> ZOOKEEPER-1413 - The leader use txnlog to synchronize with the >>>>>> learners.It need to read txnlog in sequential manner starting from a >>>>>> given >>>>>> zxid. >>>>>> ZOOKEEPER-22 The design proposal mentioned that the leader should >>>>>> lookup >>>>>> txnlog to response to the client if a request is accepted by the >>>>>> client or >>>>>> not. The server need to lookup txn by sessionId and cxid >>>>>> ZOOKEEPER-1416 The server need to be able to tell the list of >>>>>> deleted >>>>>> nodes starting a given zxid. One possible implementation is to walk >>>>>> txnlog >>>>>> staring from a given zxid and look for delete txn. >>>>>> >>>>>> Do we need to change the way we store txnlog so that we can ensure >>>>>> integrity and more efficient lookup? >>>>>> >>>>>> -- >>>>>> Thawan Kooburat >>>>>> >>>> >>> >> >
