[
https://issues.apache.org/jira/browse/HADOOP-3628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12624383#action_12624383
]
Steve Loughran commented on HADOOP-3628:
----------------------------------------
Konstantin -glad you like the slides, we could do a phone conf and give them to
you, ideally with a demo of what I have working
ping() is interesting because its a way of checking -even remotely- that a
system is healthy. A good check here verifies that all the composite parts of
the system are happy, and if so, its happy too. Which means that being able to
aggregate the health of other things is handy. But the ping() can do more than
just report the current health/state, it can do some extra work that checks
state better. For example, if a health requirement is that a specific directory
must be writeable and its clock must be close to that of the host CPU, a good
test is: create a file of a few bytes and check its timestamp. But at the same
time, its good to have a fast test without major side effects. If the ping()
created a 20MB file then it would be slow and other things could suffer. Fast
tests are good.
Some goals of the design are
-easy to subclass. I'm experimenting with an extended name node that fails a
ping() when a min# of workers are attached.
-easy to aggregate. A datanode is live if the filesystem.ping() is happy; a
cluster could be defined as live if 3 out of 5 datanodes were up at any point
in time.
-let us extend with more than just the main services. I have things that I ping
to hit web pages and expect a response in a specific range, or check for files
existing, without needing to create new threads for everything that I check.
The current split between getServiceState() and ping() has the
getServiceState() call reporting the state at the last time it was changed,
while ping() can actually do some health checking. And it doesnt need another
thread if you don't want to do work in the background.
>Sometimes it is not clear what is the difference between INITIALIZED, STARTED,
>and LIVE.
>Or why does it matter whether the server was TERMINATED or FAILED.
I will try and document the design. It's based on some of the stuff I've done
in smartfrog and some of the grid standards bodies, where I learned to avoid
any complex DMTF compatible state model as it got way too complex fast.
INITIALIZED: you've read your stuff in, not started any threads or anything.
This is done in init(), and is a good place for subclasses to do tricks before
or after super.innerInit() is called.
STARTED: you've been told to start, and once you are ready, you will go live
LIVE: ready to do useful work. Something may still go wrong on a request, but,
well, that's the only way to be 100% sure that anything is healthy.
FAILED: Something went wrong, you have no intention of ever working again, but
are still instantiated.
TERMINATED: at this point you are about to be deleted; all references to you
removed. Its the end point for the state graph.
> Add a lifecycle interface for Hadoop components: namenodes, job clients, etc.
> -----------------------------------------------------------------------------
>
> Key: HADOOP-3628
> URL: https://issues.apache.org/jira/browse/HADOOP-3628
> Project: Hadoop Core
> Issue Type: Improvement
> Components: dfs, mapred
> Affects Versions: 0.19.0
> Reporter: Steve Loughran
> Assignee: Steve Loughran
> Attachments: AbstractHadoopComponent.java, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch,
> hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch, hadoop-3628.patch
>
>
> I'd like to propose we have a standard interface for hadoop components, the
> things that get started or stopped when you bring up a namenode. currently,
> some of these classes have a stop() or shutdown() method, with no standard
> name/interface, but no way of seeing if they are live, checking their health
> of shutting them down reliably. Indeed, there is a tendency for the spawned
> threads to not want to die; to require the entire process to be killed to
> stop the workers.
> Having a standard interface would make it easier for
> * management tools to manage the different things
> * monitoring the state of things
> * subclassing
> The latter is interesting as right now TaskTracker and JobTracker start up
> threads in their constructor; that's very dangerous as subclasses may have
> their methods called before they are full initialised. Adding this interface
> would be the right time to clean up the startup process so that subclassing
> is less risky.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.