Dennis Kubes wrote: > All, > > We are starting to design and develop a framework in Python that will > better automate different pieces of Nutch and Hadoop administration > and deployment and we wanted to get community feedback. > > We first want to replace the DFS and MapReduce startup scripts with > python alternatives. All of the features below are assumed to be > executed from a central location (for example the namenode). These > script would allow expanded functionality that would include: > > 1) Start, stop, restart of individual DFS and MapReduce nodes. (This > would be able to start and stop the namenode and jobtracker as well > but would first check to see if data/task nodes were running and take > appropriate action.) > 2) Start, stop, restart dfs or map reduce cluster independently. > 3) Start, stop, restart the entire dfs and map reduce cluster. > 4) Allow for individual data/job nodes to have different deployment > locations. This is necessary for a heterogeneous OS cluster. > 5) Allow cross platform and heterogeneous OS clusters. > 6) Get detailed status of individual nodes or all nodes. This would > include items such as disk space, cpu usage, etc. > 7) Reboot or shutdown machines. Again this would take into account > running services. > Well, i think heterogeneous OS cluster might not be a good idea. Managing more both linux and win in one script might be tricky.
> Next we would like to split the tools for nutch (such as crawl, > invertlinks, etc.) and the tools for hadoop (dfs, job, etc.) into > their own individual python scripts that would allow the following > functionality. > > 1) Dynamic configuration of variables or resetting of config > directory. This might need to be enhanced with changes to the > configuration classes in Hadoop (don know yet). > 2) Dynamically set other variables such as java heap space and log > file directories. You dont have to change Configuration classes. Each runnable class in nutch (except crawl) extends ToolBase, which allows -conf <conf_dir> argument. Java heap space is configurable from bin/nutch, so a little modification will work. NUTCH_LOG_DIR is read from environment in bin/nutch. > > We already have a script that automates a continual fetching process > in user defined blocks of number of urls. This script handles the > entire process of injecting, generating, fetching, updating db, > merging segments and crawl databases and looping and doing the next > fetch and so on until a stop command is given. > > Next, we want to create python script that will automate deployment to > various nodes and perform maintenance tasks. Unless otherwise stated > the scripts would be able to deploy to different deployment locations > configured per machine and allow deployment to an individual machine, > a list of machines, or all machines in the cluster. It would also > allow either the backing up or removal of old items. This would include: > > 1) Deploy new release, all code and files. > 2) Deploy all lib files. > 3) Deploy all conf. > 4) Deploy a single file. > 5) Deploy all bin files. > 6) Deploy a single plugin. > 7) Deploy the Nutch job and jar files. > 8) Deploy all plugins > 9) Remove all log files or archive to a given location. Well, unfortunately, there are lots of issues in updating the current codebase. Lots of manual testing should be done. Sometimes a new features includes bugs. And previous files become incompatible. From my experience, automating such tasks is not straightforward. > > Finally we would like to automate search index deployment and > administration. Unless otherwise stated the scripts would be able to > deploy to target different locations configured per machine and allow > targeting to an individual machine, a list of machines, or all > machines in the cluster. This functionality would include: > > 1) Configure a cluster of search servers. > 2) Deploy, remove, and redeploy index pieces (parts of an index) to > search servers. > 3) Start, Stop, and restart search servers. Well, this can be handy. I have written a script which uses start-stop-daemon to start-stop the index servers as a background process. The script also checks if the status of the server. But what is really needed in nutch is dynamically adding and removing a bunch of index servers w/o effecting the front-end. > > We would have detailed help screens and and fully documented scripts > using a common framework of scripts. If it was designed correctly we > could setup job streams that did automatic crawls, re-crawls, > integrations, indexing, and deployments to search servers. All of > which would be needed for the ongoing operation of a web search engine. > > There is a catch and that is that this functionality would require > python to be installed on at least the controller node. > This would be a push to the machines and it would be implemented in > python using pexpect and probably implementing commands through ssh, etc. > I think python dependency wont be a problem if this management script is well enough. A web interface along with the python interface would be great i think. > If you have thoughts on this please let us know and we will see if we > can integrate the requests into the development. > > Dennis Kubes > Good luck with that, i will be looking forward to see it. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers