[
https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14069344#comment-14069344
]
Steve Loughran commented on HADOOP-10641:
-----------------------------------------
bq. This is a good idea in the abstract, but the notion of applying Amazon's
process to a volunteer open source project is problematic.
Consensus protocols are expected to provide proofs of the algorithms
correctness; anything derived from Paxos, Raft et al rely on those algorithms
being considered valid, and the implementors being able to understand the
algorithms. Open source consensus protocol *implementations* are expected to
publish their inner workings, else they can't be trusted. I will site Apache
Zookeeper's [ZAB protocol|http://web.stanford.edu/class/cs347/reading/zab.pdf],
and [Anubis's consistent T-space
model|http://www.hpl.hp.com/techreports/2005/HPL-2005-72.html], as examples of
two OSS products that I have used and implementations that I trust.
bq. In terms of the Hadoop contribution process, this is a novel requirement.
Implementations of distributed consensus protocols already a one place where
the team needs people who understands the maths. If a team implementing a
protocol aren't able to specify it formally in some form or other: run. And if
someone tries to submit changes to the core protocols of an OSS implementation
who can't prove that it works, I would hope that the patch will be rejected.
Which is why I believe this specific JIRA "provide an API and reference
implementation of distributed updates" is suitable for the criteria "provide a
strict specification". I'm confident that someone in the WanDisco dev team will
be able to do this, and would make "understand this specification" a pre req
for anyone else doing their own implementation.
Even so, we can't expect complete proofs of correctness. Which is why I said
"any maths that can be provided, and test cases".
For HADOOP-9361, the test cases were the main outcome: by enumerating
invariants and pre/post conditions, some places where we didn't have enough
tests became apparent. These were mostly failure modes of some operations (e.g.
what happens when preconditions aren't met).
Derived tests are great as:
# Jenkins can run them; you can't get mathematicians to prove things during
automated regression tests.
# It makes it easier to decide if a test failure is due to an error in the
test, or a failure of the code. If a specification-derived test fails, then it
is now due to either an error in the specification or the code.
I think we need to do the same here: from a specification of the API, build the
test cases which can verify the behavior as well as local tests can. Those
implementors of the back end now get those tests alongside a specification
which defines what they have to implement.
The next issue becomes "can people implementing things understand the
specification?". It's why I used a notation that uses Python expressions and
data structures; one that should be easy to understand. It's also why users of
the TLA+ stuff in the Java & C/C++ world tend to use the curly-braced form of
the language.
I'm sorry if this appears harsh or that I've suddenly added a new criteria to
what Hadoop patches have to do, but given this Coordination Manager is proposed
as a central part in a future HDFS and YARN RM, then yes, we do have to define
it properly.
> Introduce Coordination Engine
> -----------------------------
>
> Key: HADOOP-10641
> URL: https://issues.apache.org/jira/browse/HADOOP-10641
> Project: Hadoop Common
> Issue Type: New Feature
> Affects Versions: 3.0.0
> Reporter: Konstantin Shvachko
> Assignee: Plamen Jeliazkov
> Attachments: HADOOP-10641.patch, HADOOP-10641.patch,
> HADOOP-10641.patch, hadoop-coordination.patch
>
>
> Coordination Engine (CE) is a system, which allows to agree on a sequence of
> events in a distributed system. In order to be reliable CE should be
> distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC,
> zab) and have different implementations, depending on use cases, reliability,
> availability, and performance requirements.
> CE should have a common API, so that it could serve as a pluggable component
> in different projects. The immediate beneficiaries are HDFS (HDFS-6469) and
> HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.
--
This message was sent by Atlassian JIRA
(v6.2#6252)