Hi Devs, I did the initial implementation of Hadoop provider based on OGCE implementation and also tried the Apache Whirr to deploy Hadoop on EC2. But there are certain issues to solve and decisions to take before finishing the first phase of implementation.
1. There are multiple Hadoop versions out there including 0.20.x, 0.22.x, 0.23.x, 1.0.x, 2.0.0-alpha. 0.23.x(now renamed to 2.0.0) is a complete overhaul of previous MapReduce implementation and most probably stable release will be available end of this year and 0.22.x(1.0.x) is the widely(I'm guessing) used version. We need to decide which version we are going to support. 2. There are two ways of submitting jobs to Hadoop. First is to use jar command of the 'hadoop' command line tool and this assumes that you have Hadoop configured(local or remote cluster) in your system. Other method is to use Hadoop Job API. This way we can do the job submission programmatically, but we need to specify Hadoop configuration files and other related things like jar file, Mapper and Reducer class names. 3. We can use Whirr to setup Hadoop on a local cluster as well. But local cluster configuration doesn't support certificate based authentication, we need to specify user name/password pairs and root passwords of the machines in the local cluster. I think this is enough for the initial implementation. WDYT? In addition to above I have several questions specific to Hadoop provider configuration. We may need to add additional configuration parameters to current GFac configuration. But these will change based on the above decisions we made. So I'll send a separate mail on that later. Please feel free to comment of above and let me know if you need more details. Thanks Milinda -- Milinda Pathirage PhD Student Indiana University, Bloomington; E-mail: [email protected] Web: http://mpathirage.com Blog: http://blog.mpathirage.com
