[ 
https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069344#comment-14069344
 ] 

Steve Loughran commented on HADOOP-10641:
-----------------------------------------

bq. This is a good idea in the abstract, but the notion of applying Amazon's 
process to a volunteer open source project is problematic.



Consensus protocols are expected to provide proofs of the algorithms 
correctness; anything derived from Paxos, Raft et al rely on those algorithms 
being considered valid, and the implementors being able to understand the 
algorithms. Open source consensus protocol *implementations* are expected to 
publish their inner workings, else they can't be trusted. I will site Apache 
Zookeeper's [ZAB protocol|http://web.stanford.edu/class/cs347/reading/zab.pdf], 
and [Anubis's consistent T-space 
model|http://www.hpl.hp.com/techreports/2005/HPL-2005-72.html], as examples of 
two OSS products that I have used and implementations that I trust. 


bq.  In terms of the Hadoop contribution process, this is a novel requirement. 

Implementations of distributed consensus protocols already a one place where 
the team needs people who understands the maths. If a team implementing a 
protocol aren't able to specify it formally in some form or other: run. And if 
someone tries to submit changes to the core protocols of an OSS implementation 
who can't prove that it works, I would hope that the patch will be rejected. 


Which is why I believe this specific JIRA "provide an API and reference 
implementation of distributed updates" is suitable for the criteria "provide a 
strict specification". I'm confident that someone in the WanDisco dev team will 
be able to do this, and would make "understand this specification" a pre req 
for anyone else doing their own implementation. 

Even so, we can't expect complete proofs of correctness. Which is why I said 
"any maths that can be provided, and test cases".

For HADOOP-9361, the test cases were the main outcome: by enumerating 
invariants and pre/post conditions, some places where we didn't have enough 
tests became apparent. These were mostly failure modes of some operations (e.g. 
what happens when preconditions aren't met).

Derived tests are great as:
# Jenkins can run them; you can't get mathematicians to prove things during 
automated regression tests.
# It makes it easier to decide if a test failure is due to an error in the 
test, or a failure of the code. If a specification-derived test fails, then it 
is now due to either an error in the specification or the code.

I think we need to do the same here: from a specification of the API, build the 
test cases which can verify the behavior as well as local tests can. Those 
implementors of the back end now get those tests alongside a specification 
which defines what they have to implement. 

The next issue becomes "can people implementing things understand the 
specification?". It's why I used a notation that uses Python expressions and 
data structures; one that should be easy to understand. It's also why users of 
the TLA+ stuff in the Java & C/C++ world tend to use the curly-braced form of 
the language. 

I'm sorry if this appears harsh or that I've suddenly added a new criteria to 
what Hadoop patches have to do, but given this Coordination Manager is proposed 
as a central part in a future HDFS and YARN RM, then yes, we do have to define 
it properly. 


> Introduce Coordination Engine
> -----------------------------
>
>                 Key: HADOOP-10641
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10641
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Plamen Jeliazkov
>         Attachments: HADOOP-10641.patch, HADOOP-10641.patch, 
> HADOOP-10641.patch, hadoop-coordination.patch
>
>
> Coordination Engine (CE) is a system, which allows to agree on a sequence of 
> events in a distributed system. In order to be reliable CE should be 
> distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC, 
> zab) and have different implementations, depending on use cases, reliability, 
> availability, and performance requirements.
> CE should have a common API, so that it could serve as a pluggable component 
> in different projects. The immediate beneficiaries are HDFS (HDFS-6469) and 
> HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to