RE: Questions about Hadoop

Arijit Mukherjee Wed, 24 Sep 2008 05:46:23 -0700

That's a very good overview Paco - thanx for that. I might get back to
you with more queries about cascade etc. at some time - hope you
wouldn't mind.

Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com

-----Original Message-----
From: Paco NATHAN [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 6:10 PM
To: [email protected]; [EMAIL PROTECTED]
Subject: Re: Questions about Hadoop

Arijit,

For workflow, check out http://cascading.org  -- that works quite well
and fits what you described.

Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/

In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite
good for situations where there are lots of ad hoc queries, business
intelligence needs short-term, less-technical staff involved. However,
if there are large, repeated batch jobs which require significant
analytics work, then I'm not so convinced that SQL is the right mind-set
for representing the math required for algorithms or for maintaining
complex code throughout the software lifecycle.

I run an analytics group where our statisticians use R, while our
developers use Hadoop, Cascading, etc., at scale on terabytes.  One
approach is simply to sample data, analyze it in R, then use the
analysis to articulate requirements for developers to use at scale.

In terms of running R on large data, one issue is that -- in contrast to
SAS, where data is handled line-by-line -- R is limited by how much data
can be loaded into memory.

Another issue is that while some areas of statistical data analysis are
suitable for MapReduce, others clearly are not. Mahout or similar
projects may go far, but do not expect them to be capable of displacing
R, SAS, etc.  For example, you can accomplish much by scanning a data
set to determine N, sum X, sum X^X, etc., to produce descriptive stats,
quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However,
MapReduce requires data independence, so it will not serve well for
tasks such as inverting a matrix.

You might want to look into Parallel R, and talk with
http://www.revolution-computing.com/

Our team has a project which runs Hadoop workflows underneath R.  It is
at an early stage, and there's no plan yet about a public release. It's
not a simple thing to implement by any stretch of the imagination!

Best,
Paco

On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
<[EMAIL PROTECTED]> wrote:
> Thanx Enis.
>
> By workflow, I was trying to mean something like a chain of MapReduce 
> jobs - the first one will extract a certain amount of data from the 
> original set and do some computation resulting in a smaller summary, 
> which will then be the input to a further MR job, and so on...somewhat

> similar to a workflow as in the SOA world.
>
> Is it possible to use statistical analysis tools such as R (or say 
> PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is 
> working on a custom MapReduce engine over their Greenplum database 
> which will also support PL/R procedures.
>
> Arijit
>
> Dr. Arijit Mukherjee
> Principal Member of Technical Staff, Level-II
> Connectiva Systems (I) Pvt. Ltd.
> J-2, Block GP, Sector V, Salt Lake
> Kolkata 700 091, India
> Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com
>
No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
9/23/2008 6:32 PM

RE: Questions about Hadoop

Reply via email to