The two key points I can extract from this discussion and please feel free to 
add to my list are:

- We can't tolerate arbitrary corruption of log entries. We can tolerate 
corruption of a log suffix due to a crash, in which case the txns in the log 
have not been acknowledged.
- Verifying digests is possibly expensive, so we might need to look at ways to 
avoid the performance penalty, like caching txns in memory.

One more comment below:
 
On Jun 4, 2013, at 9:22 PM, Thawan Kooburat <[email protected]> wrote:

> 
> On 6/3/13 9:54 AM, "Flavio Junqueira" <[email protected]> wrote:
> 
>> On Jun 3, 2013, at 12:41 AM, Thawan Kooburat <[email protected]> wrote:
>> 
>>> From my understanding, ZooKeeper currently maintains data integrity by
>>> validating all the data before loading it in to memory. Disk-related
>>> errors on one of the machine won't affect the correctness of the
>>> ensemble
>>> since we are serving client or peer request from in-memory data only.
>> 
>> Let me try to be a bit more concrete. Say that we corrupt arbitrarily a
>> txn T in a log file, and that T has been acknowledged by 3 servers (S1,
>> S2, S3) in an ensemble of 5 servers (S1, S2, S3, S4, S5). Let's assume
>> that S3 has corrupted T in its log. Next say that S5 becomes the leader
>> supported by S3 and S4 (S3 has restarted). We can elect S5 because it has
>> the same history as S3 and S3 has corrupted T (we ignore any transaction
>> it may have after T), which S5 doesn't have. If this can happen, then we
>> lost T even though T has been acknowledged by a quorum.
>> 
>> In any case, I'm interested in defining precisely what integrity
>> guarantees we provide for txn logs/snapshots. The point I was trying to
>> convey is that we can't tolerate arbitrary corruptions of the txn log. We
>> can only tolerate (and I'm not convinced there is a reason to push it
>> further) corruption of a suffix of the txn log that has not been
>> acknowledged and the txns in this suffix haven't been acknowledged
>> because the server crashed before they have been completely flushed to
>> disk.
> 
> I believe the problem you are describing here is essentially the fact that
> we have more failure than we can tolerate. Ideally, if S1 or S2
> participated in the next round of leader election, S1 or S2 should be
> elected as a leader because they have the highest zxid. S3 has txnlog
> corruption at T so it should reports its zxid as T-1 during leader
> election. 
> 
> 
> Because of how leader election works, a corruption in less than a majority
> should not affect the correctness. However, in ZK-1413,ZK-22,ZK-1416, a
> server use its local txnlog to response to a request. So they are
> vulnerable to a single machine disk corruption or operator error. However,
> it won't affect correctness if we can detect the corruption correctly.

In the case I described above, there was a corruption only in one server and 
yet it caused a problematic scenario. I don't think we can claim that we can 
tolerate corruption of a minority. One single corruption might be problematic 
already.

> 
>> 
>>> 
>>> However, in ZK-1413. The leader use on-disk txnlog to synchronize with
>>> the
>>> learner. It seem like we have to keep checking txnlog integrity every
>>> time
>>> we read something from disk. And I don't think integrity check is cheap
>>> too since we have to scan the entire history (starting from a given
>>> zxid).
>> 
>> For the average case, this might not be too bad. If I remember correctly,
>> it is possible to calibrate the amount of transactions a server is
>> willing to read from disk when deciding whether to send a snapshot.
>> 
>>> 
>>> If we cache txnlog in memory, we only need to do integrity check once
>>> and
>>> we can also built some indexes on top of it to support more efficient
>>> lookup. However, this is going to consume a lot of memory.
>>> 
>> 
>> Agreed, although I'd rather generate a few numbers before we claim it is
>> bad and that we need a cache
> 
> For 1413, the current implementation works fine if the parameters are
> configured appropriately. I mentioned caching because other features like
> ZK-22 or ZK-1416 might need this. If we ever need to modify txnlog
> facility we can think of way to solve problems for other features has
> well.  
> 
>> 
>>> On the other hand, these features (ZK-1413,ZK-22,ZK-1416) don't really
>>> need the entire txnlog to be valid. The server can always say to the
>>> client that the history needed to answer the request is too old and
>>> there
>>> is fall back mechanism that allows system to make progress correctly.
>>> From example, in ZK-1413, the leader can fall back to send a snapshot to
>>> the learner if it cannot use txnlog due to any reason.
>> 
>> Sure, this covers some cases, but I don't see how it covers the case
>> above. I think it doesn't, right?
> 
> 
> 
> 
> 
> 
> 
>> 
>> -Flavio
>> 
>>> 
>>> 
>>> -- 
>>> Thawan Kooburat
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On 6/1/13 8:18 AM, "Flavio Junqueira" <[email protected]> wrote:
>>> 
>>>> I think this discussion has been triggered by a discussion we have had
>>>> for ZOOKEEPER-1413. In the patch Thawan proposed there, there was a
>>>> method reads txn logs and it simply logs an error in the case of an
>>>> exception while reading the log. I raised the question of whether we
>>>> should do more than simply logging an error message and the discussion
>>>> about txn log started, but it seems to be a discussion that is out of
>>>> the
>>>> scope of 1413, so we thought it would be good to have this discussion
>>>> separately,
>>>> 
>>>> Here are a few thoughts about the issue. We can't really tolerate
>>>> arbitrary corruptions of the txn log because it could imply that we
>>>> lose
>>>> quorum for a txn that has been processed and a response has been
>>>> returned
>>>> to the client. In the case that a faulty server only partially writes a
>>>> txn into a txn log because it crashes, the logged txn is corrupt, but
>>>> we
>>>> don't really have an issue because the server has not acked the txn, so
>>>> if there is a quorum for that txn, the faulty server is not really part
>>>> of it. Cases like this I believe we can do something about, but more
>>>> generally taking care of txn log integrity sounds like a hard problem.
>>>> 
>>>> -Flavio
>>>> 
>>>> 
>>>> On Jun 1, 2013, at 4:29 PM, Camille Fournier <[email protected]>
>>>> wrote:
>>>> 
>>>>> I think it's an interesting idea certainly worth discussing. Do you
>>>>> have
>>>>> any proposals for how we might modify? What should we think about wrt
>>>>> migration/backwards compatibility?
>>>>> 
>>>>> C
>>>>> 
>>>>> 
>>>>> On Fri, May 31, 2013 at 8:26 PM, Thawan Kooburat <[email protected]>
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> I just want to start a discussion about the usage of txnlog. Here is
>>>>>> the
>>>>>> list of features that need to lookup information from txnlog. Theses
>>>>>> feature need to ensure the integrity of txnlog and having an
>>>>>> efficient
>>>>>> lookup is good for performance as well.
>>>>>> 
>>>>>> ZOOKEEPER-1413 -  The leader use txnlog to synchronize with the
>>>>>> learners.It need to read txnlog in sequential manner starting from a
>>>>>> given
>>>>>> zxid.
>>>>>> ZOOKEEPER-22 ­ The design proposal mentioned that the leader should
>>>>>> lookup
>>>>>> txnlog to response to the client if a request is accepted by the
>>>>>> client or
>>>>>> not. The server need to lookup txn by sessionId and cxid
>>>>>> ZOOKEEPER-1416 ­ The server need to be able to tell the list of
>>>>>> deleted
>>>>>> nodes starting a given zxid.  One possible implementation is to walk
>>>>>> txnlog
>>>>>> staring from a given zxid and look for delete txn.
>>>>>> 
>>>>>> Do we need to change the way we store txnlog so that we can ensure
>>>>>> integrity and more efficient lookup?
>>>>>> 
>>>>>> --
>>>>>> Thawan Kooburat
>>>>>> 
>>>> 
>>> 
>> 
> 

Reply via email to