All,

We are starting to design and develop a framework in Python that will 
better automate different pieces of Nutch and Hadoop administration and 
deployment and we wanted to get community feedback.

We first want to replace the DFS and MapReduce startup scripts with 
python alternatives.  All of the features below are assumed to be 
executed from a central location (for example the namenode).  These 
script would allow expanded functionality that would include:

1) Start, stop, restart of individual DFS and MapReduce nodes.  (This 
would be able to start and stop the namenode and jobtracker as well but 
would first check to see if data/task nodes were running and take 
appropriate action.)
2) Start, stop, restart dfs or map reduce cluster independently.
3) Start, stop, restart the entire dfs and map reduce cluster.
4) Allow for individual data/job nodes to have different deployment 
locations.  This is necessary for a heterogeneous OS cluster.
5) Allow cross platform and heterogeneous OS clusters.
6) Get detailed status of individual nodes or all nodes.  This would 
include items such as disk space, cpu usage, etc.
7) Reboot or shutdown machines.  Again this would take into account 
running services.

Next we would like to split the tools for nutch (such as crawl, 
invertlinks, etc.) and the tools for hadoop (dfs, job, etc.) into their 
own individual python scripts that would allow the following functionality.

1) Dynamic configuration of variables or resetting of config directory. 
  This might need to be enhanced with changes to the configuration 
classes in Hadoop (don know yet).
2) Dynamically set other variables such as java heap space and log file 
directories.

We already have a script that automates a continual fetching process in 
user defined blocks of number of urls.  This script handles the entire 
process of injecting, generating, fetching, updating db, merging 
segments and crawl databases and looping and doing the next fetch and so 
on until a stop command is given.

Next, we want to create python script that will automate deployment to 
various nodes and perform maintenance tasks.  Unless otherwise stated 
the scripts would be able to deploy to different deployment locations 
configured per machine and allow deployment to an individual machine, a 
list of machines, or all machines in the cluster.  It would also allow 
either the backing up or removal of old items.  This would include:

1) Deploy new release, all code and files.
2) Deploy all lib files.
3) Deploy all conf.
4) Deploy a single file.
5) Deploy all bin files.
6) Deploy a single plugin.
7) Deploy the Nutch job and jar files.
8) Deploy all plugins
9) Remove all log files or archive to a given location.

Finally we would like to automate search index deployment and 
administration.   Unless otherwise stated the scripts would be able to 
deploy to target different locations configured per machine and allow 
targeting to an individual machine, a list of machines, or all machines 
in the cluster. This functionality would include:

1) Configure a cluster of search servers.
2) Deploy, remove, and redeploy index pieces (parts of an index) to 
search servers.
3) Start, Stop, and restart search servers.

We would have detailed help screens and and fully documented scripts 
using a common framework of scripts.  If it was designed correctly we 
could setup job streams that did automatic crawls, re-crawls, 
integrations, indexing, and deployments to search servers.  All of which 
would be needed for the ongoing operation of a web search engine.

There is a catch and that is that this functionality would require 
python to be installed on at least the controller node.
This would be a push to the machines and it would be implemented in 
python using pexpect and probably implementing commands through ssh, etc.

If you have thoughts on this please let us know and we will see if we 
can integrate the requests into the development.

Dennis Kubes

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to