Re: Cross Platform Administration and Deployment for Nutch and Hadoop

Enis Soztutar Wed, 24 Jan 2007 02:00:19 -0800

Dennis Kubes wrote:

All,
We are starting to design and develop a framework in Python that willbetter automate different pieces of Nutch and Hadoop administrationand deployment and we wanted to get community feedback.
We first want to replace the DFS and MapReduce startup scripts withpython alternatives. All of the features below are assumed to beexecuted from a central location (for example the namenode). Thesescript would allow expanded functionality that would include:
1) Start, stop, restart of individual DFS and MapReduce nodes. (Thiswould be able to start and stop the namenode and jobtracker as wellbut would first check to see if data/task nodes were running and takeappropriate action.)
2) Start, stop, restart dfs or map reduce cluster independently.
3) Start, stop, restart the entire dfs and map reduce cluster.
4) Allow for individual data/job nodes to have different deploymentlocations. This is necessary for a heterogeneous OS cluster.
5) Allow cross platform and heterogeneous OS clusters.
6) Get detailed status of individual nodes or all nodes. This wouldinclude items such as disk space, cpu usage, etc.7) Reboot or shutdown machines. Again this would take into accountrunning services.

Well, i think heterogeneous OS cluster might not be a good idea.Managing more both linux and win in one script might be tricky.

Next we would like to split the tools for nutch (such as crawl,invertlinks, etc.) and the tools for hadoop (dfs, job, etc.) intotheir own individual python scripts that would allow the followingfunctionality.
1) Dynamic configuration of variables or resetting of configdirectory. This might need to be enhanced with changes to theconfiguration classes in Hadoop (don know yet).2) Dynamically set other variables such as java heap space and logfile directories.

You dont have to change Configuration classes. Each runnable class innutch (except crawl) extends ToolBase, which allows -conf <conf_dir>argument.Java heap space is configurable from bin/nutch, so a little modificationwill work. NUTCH_LOG_DIR is read from environment in bin/nutch.

We already have a script that automates a continual fetching processin user defined blocks of number of urls. This script handles theentire process of injecting, generating, fetching, updating db,merging segments and crawl databases and looping and doing the nextfetch and so on until a stop command is given.
Next, we want to create python script that will automate deployment tovarious nodes and perform maintenance tasks. Unless otherwise statedthe scripts would be able to deploy to different deployment locationsconfigured per machine and allow deployment to an individual machine,a list of machines, or all machines in the cluster. It would alsoallow either the backing up or removal of old items. This would include:
1) Deploy new release, all code and files.
2) Deploy all lib files.
3) Deploy all conf.
4) Deploy a single file.
5) Deploy all bin files.
6) Deploy a single plugin.
7) Deploy the Nutch job and jar files.
8) Deploy all plugins
9) Remove all log files or archive to a given location.

Well, unfortunately, there are lots of issues in updating the currentcodebase. Lots of manual testing should be done. Sometimes a newfeatures includes bugs. And previous files become incompatible. From myexperience, automating such tasks is not straightforward.

Finally we would like to automate search index deployment andadministration. Unless otherwise stated the scripts would be able todeploy to target different locations configured per machine and allowtargeting to an individual machine, a list of machines, or allmachines in the cluster. This functionality would include:
1) Configure a cluster of search servers.
2) Deploy, remove, and redeploy index pieces (parts of an index) tosearch servers.
3) Start, Stop, and restart search servers.

Well, this can be handy. I have written a script which usesstart-stop-daemon to start-stop the index servers as a backgroundprocess. The script also checks if the status of the server. But what isreally needed in nutch is dynamically adding and removing a bunch ofindex servers w/o effecting the front-end.

We would have detailed help screens and and fully documented scriptsusing a common framework of scripts. If it was designed correctly wecould setup job streams that did automatic crawls, re-crawls,integrations, indexing, and deployments to search servers. All ofwhich would be needed for the ongoing operation of a web search engine.
There is a catch and that is that this functionality would requirepython to be installed on at least the controller node.This would be a push to the machines and it would be implemented inpython using pexpect and probably implementing commands through ssh, etc.

I think python dependency wont be a problem if this management script iswell enough.

A web interface along with the python interface would be great i think.

If you have thoughts on this please let us know and we will see if wecan integrate the requests into the development.
Dennis Kubes

Good luck with that, i will be looking forward to see it.

Re: Cross Platform Administration and Deployment for Nutch and Hadoop

Reply via email to