Dennis Kubes wrote:
> All,
>
> We are starting to design and develop a framework in Python that will 
> better automate different pieces of Nutch and Hadoop administration 
> and deployment and we wanted to get community feedback.
>
> We first want to replace the DFS and MapReduce startup scripts with 
> python alternatives.  All of the features below are assumed to be 
> executed from a central location (for example the namenode).  These 
> script would allow expanded functionality that would include:
>
> 1) Start, stop, restart of individual DFS and MapReduce nodes.  (This 
> would be able to start and stop the namenode and jobtracker as well 
> but would first check to see if data/task nodes were running and take 
> appropriate action.)
> 2) Start, stop, restart dfs or map reduce cluster independently.
> 3) Start, stop, restart the entire dfs and map reduce cluster.
> 4) Allow for individual data/job nodes to have different deployment 
> locations.  This is necessary for a heterogeneous OS cluster.
> 5) Allow cross platform and heterogeneous OS clusters.
> 6) Get detailed status of individual nodes or all nodes.  This would 
> include items such as disk space, cpu usage, etc.
> 7) Reboot or shutdown machines.  Again this would take into account 
> running services.
>
Well, i think heterogeneous OS cluster might not be a good idea. 
Managing more both linux and win in one script might be tricky.

> Next we would like to split the tools for nutch (such as crawl, 
> invertlinks, etc.) and the tools for hadoop (dfs, job, etc.) into 
> their own individual python scripts that would allow the following 
> functionality.
>
> 1) Dynamic configuration of variables or resetting of config 
> directory.  This might need to be enhanced with changes to the 
> configuration classes in Hadoop (don know yet).
> 2) Dynamically set other variables such as java heap space and log 
> file directories.
You dont have to change Configuration classes. Each runnable class in 
nutch (except crawl) extends ToolBase, which allows -conf <conf_dir> 
argument.
Java heap space is configurable from bin/nutch, so a little modification 
will work. NUTCH_LOG_DIR is read from environment in bin/nutch.

>
> We already have a script that automates a continual fetching process 
> in user defined blocks of number of urls.  This script handles the 
> entire process of injecting, generating, fetching, updating db, 
> merging segments and crawl databases and looping and doing the next 
> fetch and so on until a stop command is given.
>
> Next, we want to create python script that will automate deployment to 
> various nodes and perform maintenance tasks.  Unless otherwise stated 
> the scripts would be able to deploy to different deployment locations 
> configured per machine and allow deployment to an individual machine, 
> a list of machines, or all machines in the cluster.  It would also 
> allow either the backing up or removal of old items.  This would include:
>
> 1) Deploy new release, all code and files.
> 2) Deploy all lib files.
> 3) Deploy all conf.
> 4) Deploy a single file.
> 5) Deploy all bin files.
> 6) Deploy a single plugin.
> 7) Deploy the Nutch job and jar files.
> 8) Deploy all plugins
> 9) Remove all log files or archive to a given location.
Well, unfortunately,  there are lots of issues in updating the current 
codebase. Lots of manual testing should be done. Sometimes a new 
features includes bugs. And previous files become incompatible. From my 
experience, automating such tasks is not straightforward.

>
> Finally we would like to automate search index deployment and 
> administration.   Unless otherwise stated the scripts would be able to 
> deploy to target different locations configured per machine and allow 
> targeting to an individual machine, a list of machines, or all 
> machines in the cluster. This functionality would include:
>
> 1) Configure a cluster of search servers.
> 2) Deploy, remove, and redeploy index pieces (parts of an index) to 
> search servers.
> 3) Start, Stop, and restart search servers.
Well, this can be handy. I have written a script which uses 
start-stop-daemon to start-stop the index servers as a background 
process. The script also checks if the status of the server. But what is 
really needed in nutch is dynamically adding and removing a bunch of 
index servers w/o effecting the front-end.

>
> We would have detailed help screens and and fully documented scripts 
> using a common framework of scripts.  If it was designed correctly we 
> could setup job streams that did automatic crawls, re-crawls, 
> integrations, indexing, and deployments to search servers.  All of 
> which would be needed for the ongoing operation of a web search engine.
>
> There is a catch and that is that this functionality would require 
> python to be installed on at least the controller node.
> This would be a push to the machines and it would be implemented in 
> python using pexpect and probably implementing commands through ssh, etc.
>
I think python dependency wont be a problem if this management script is 
well enough.
A web interface along with the python interface would be great i think.

> If you have thoughts on this please let us know and we will see if we 
> can integrate the requests into the development.
>
> Dennis Kubes
>
Good luck with that, i will be looking forward to see it.




-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to