Re: Hadoop Provider for GFac

Suresh Marru Sun, 03 Jun 2012 21:58:31 -0700

Hi Milinda,

On Jun 3, 2012, at 10:28 AM, Milinda Pathirage wrote:


> Hi Devs,
> 
> I did the initial implementation of Hadoop provider based on OGCE
> implementation and also tried the Apache Whirr to deploy Hadoop on
> EC2. But there are certain issues to solve and decisions to take
> before finishing the first phase of implementation.

Good to hear on your progress, we will appreciate your patches, comments below. 
> 
> 1. There are multiple Hadoop versions out there including 0.20.x,
> 0.22.x, 0.23.x, 1.0.x, 2.0.0-alpha. 0.23.x(now renamed to 2.0.0) is a
> complete overhaul of previous MapReduce implementation and most
> probably stable release will be available end of this year and
> 0.22.x(1.0.x) is the widely(I'm guessing) used version. We need to
> decide which version we are going to support.

Sorry but I am not familiar with hadoop versions, I tried to understand [1] but 
could not figure out much. The only other criteria I would have looked for is 
XBaya's current integration with Amazon Elastic Map Reduce, but EMR seems to 
support most of the versions you mentioned [2]. You seem to have better 
understanding of hadoop versions, so unless others on the list have 
recommendation, I will defer the decision to your judgement. 

> 2. There are two ways of submitting jobs to Hadoop. First is to use
> jar command of the 'hadoop' command line tool and this assumes that
> you have Hadoop configured(local or remote cluster) in your system.
> Other method is to use Hadoop Job API. This way we can do the job
> submission programmatically, but we need to specify Hadoop
> configuration files and other related things like jar file, Mapper and
> Reducer class names.

The API options seems to be better. Maintaining properties file might be an 
issue, but that seems to be an easier to support instead of assuming hadoop 
clients are properly installed on local systems. 

> 3. We can use Whirr to setup Hadoop on a local cluster as well. But
> local cluster configuration doesn't support certificate based
> authentication, we need to specify user name/password pairs and root
> passwords of the machines in the local cluster. I think this is enough
> for the initial implementation. WDYT?

I agree, initial implementation can probably work with this limitation which 
outweighs the advantages of having a local cluster. 

> In addition to above I have several questions specific to Hadoop
> provider configuration. We may need to add additional configuration
> parameters to current GFac configuration. But these will change based
> on the above decisions we made. So I'll send a separate mail on that
> later.
> 
> Please feel free to comment of above and let me know if you need more details.

Please feel free to suggest any changes to GFac configurations and schemas 
incorporating hadoop extensions. And as always patches welcome.

Cheers,
Suresh

[1] - 
http://www.cloudera.com/blog/2012/04/apache-hadoop-versions-looking-ahead-3/
[2] - http://aws.amazon.com/elasticmapreduce/faqs/#dev-12

> 
> Thanks
> Milinda
> 
> -- 
> Milinda Pathirage
> PhD Student Indiana University, Bloomington;
> E-mail: [email protected]
> Web: http://mpathirage.com
> Blog: http://blog.mpathirage.com

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: Hadoop Provider for GFac

Reply via email to