[ 
https://issues.apache.org/jira/browse/HADOOP-10641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14108989#comment-14108989
 ] 

Steve Loughran commented on HADOOP-10641:
-----------------------------------------

Thank your for the really good TLA document —I do think it makes what's going 
on a lot clearer as now others can see strictly what implementations do. (It 
now places a requirement to me to do the same for my proposals, but I'm happy 
with that). 

It also places a requirement for me to look at the code alongside the spec, so 
here goes:

h3. Algorithm

# presumably usedIds aren't collected forever in common implementations, 
instead they use enough of a time marker in their IDs to be self-windowing.

h3. core code

# {{CoordinationEngine}} should extend Hadoop common's {{AbstractService}}; 
this makes it trivial to integrate with the lifecycle of YARN apps and when 
someone migrates the NN/DN to the lifecycle will hook to HDFS the same way. The 
init/start/stop operations all match. 
# {{ZKConfigKeys}} needs a name more tied to the coordination engine.
# {{ZKCoordinationEngine}} could use guava {{Precondition}} as a check on 
{{localNodeId}} and move it up to the init method.
# I don't like downgrading all ZK exceptions to "IOE", as it hides things like 
security exceptions, missing parents &c. Again this is something we could do 
that is more generic than just for the co-ord engine, as I've ended up doing 
some of this in my code.

# {{ZKCoordinationEngine.loopLearningUntilCaughtUp()}} 
# {{createOrGetGlobalSequenceNumber}}'s  {{while(true)}} loop appears to spin 
forever if the exception raised in its ZK actions is 
{{KeeperException.NoAuthException}}... that is, if starting on a secure cluster 
where it can't access the path. More filtering of exception types is needed 
with the unrecoverables thrown up ({{SessionExpiredException}} +maybe some 
others). 

# {{ZKCoordinationEngine.submitProposal()}}: needs better exception text, 
ideally including path and text of nested exception.
# I'm not sure I like the way {{ZKCoordinationEngine.processImpl()}} shuts 
itself down. I'd prefer some exception for the caller to process, so that owner 
of the engine is in sync.
# if an {{AgreementsThread}} fails its exception doesn't get picked up ... 
these need to be propagated back to the junit thread. Ideall also logged in 
that {{AgreementsThread}}  after {{Thread.currentThread().setName()}} has given 
it a name for the log statements. 
  
h3. tests
  
# MiniZKCluster ... I have the one from Twill converted to a YARN svc; this is 
is one i'd like to see this switch to *once the code is checked in*. It's 
lighter weight than the HBase one.
# there's always a delay for ZK startup. Could you make the test cases start 
one as a @BeforeClass and all share the same one?
# needs tests for failure conditions: no ZK cluster
# if there's a way to do this, tests for against a secure cluster, both 
succeeding and failing.
# add a test timeout via {{@Rule public final Timeout testTimeout = new 
Timeout(30000);}}
  
  
h3. minor
    
# minor: needs formatting to style, use of {} in all conditional clauses.
# Can you use SLF4K everywhere -it's more efficient as can do string expansion 
only when needed & will stop people complaining that every log action needs to 
be wrapped by log-level checks. Then switch calls like {{LOG.info("Got watched 
event: " + watchedEvent)}} to {{LOG.info("Got watched event: {}", 
watchedEvent)}}



h2. What now?


I'm pretty happy with it —though more reviewers are needed. What now? 
# We commit {{hadoop-coordination}} with the implementation into Hadoop common, 
and work towards getting it into a version of the NN which can use it for 
committing operations; maybe later in YARN and downstream.
# We work with the curator team to get it into curator and pull that into 
server-side Hadoop. YARN-913 is going to need that in the RM anyway, though so 
far HDFS hasn't. Strengths: lives with the rest of their ZK work. Weaknesses: 
looser coupling to Hadoop & a different release cycle. 
# it goes into a hadoop-zookeeper module that depends on ZK & Curator, and on 
which things which need these can depend on, including client-side code that 
wants it (ZK trickles out somehow already, pulling it apart would be cleaner). 
For example, my mini-ZK cluster as YARN service could be anothe service to add.

Strategy 3 appeals to me : it's not that different from what is there today 
(indeed, we just fix the module name for now & add more features as/when needed)


> Introduce Coordination Engine interface
> ---------------------------------------
>
>                 Key: HADOOP-10641
>                 URL: https://issues.apache.org/jira/browse/HADOOP-10641
>             Project: Hadoop Common
>          Issue Type: New Feature
>    Affects Versions: 3.0.0
>            Reporter: Konstantin Shvachko
>            Assignee: Plamen Jeliazkov
>         Attachments: HADOOP-10641.patch, HADOOP-10641.patch, 
> HADOOP-10641.patch, HADOOP-10641.patch, HADOOP-10641.patch, 
> NNThroughputBenchmark Results.pdf, ce-tla.zip, hadoop-coordination.patch, 
> zkCEBenchmark.pdf, zkbench.pdf
>
>
> Coordination Engine (CE) is a system, which allows to agree on a sequence of 
> events in a distributed system. In order to be reliable CE should be 
> distributed by itself.
> Coordination Engine can be based on different algorithms (paxos, raft, 2PC, 
> zab) and have different implementations, depending on use cases, reliability, 
> availability, and performance requirements.
> CE should have a common API, so that it could serve as a pluggable component 
> in different projects. The immediate beneficiaries are HDFS (HDFS-6469) and 
> HBase (HBASE-10909).
> First implementation is proposed to be based on ZooKeeper.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to