Re: Questions about Hadoop

Paco NATHAN Wed, 24 Sep 2008 06:20:04 -0700

Certainly. It'd be great to talk with others working in analytics and
statistical computing, who have been evaluating MapReduce as well.


Paco


On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee
<[EMAIL PROTECTED]> wrote:
> That's a very good overview Paco - thanx for that. I might get back to
> you with more queries about cascade etc. at some time - hope you
> wouldn't mind.
>
> Regards
> Arijit
>
> Dr. Arijit Mukherjee
> Principal Member of Technical Staff, Level-II
> Connectiva Systems (I) Pvt. Ltd.
> J-2, Block GP, Sector V, Salt Lake
> Kolkata 700 091, India
> Phone: +91 (0)33 23577531/32 x 107
> http://www.connectivasystems.com
>
>
> -----Original Message-----
> From: Paco NATHAN [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 24, 2008 6:10 PM
> To: [email protected]; [EMAIL PROTECTED]
> Subject: Re: Questions about Hadoop
>
>
> Arijit,
>
> For workflow, check out http://cascading.org  -- that works quite well
> and fits what you described.
>
> Greenplum and Aster Data have announced support for running MR within
> the context of their relational databases, e.g.,
> http://www.greenplum.com/resources/mapreduce/
>
> In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite
> good for situations where there are lots of ad hoc queries, business
> intelligence needs short-term, less-technical staff involved. However,
> if there are large, repeated batch jobs which require significant
> analytics work, then I'm not so convinced that SQL is the right mind-set
> for representing the math required for algorithms or for maintaining
> complex code throughout the software lifecycle.
>
>
> I run an analytics group where our statisticians use R, while our
> developers use Hadoop, Cascading, etc., at scale on terabytes.  One
> approach is simply to sample data, analyze it in R, then use the
> analysis to articulate requirements for developers to use at scale.
>
> In terms of running R on large data, one issue is that -- in contrast to
> SAS, where data is handled line-by-line -- R is limited by how much data
> can be loaded into memory.
>
> Another issue is that while some areas of statistical data analysis are
> suitable for MapReduce, others clearly are not. Mahout or similar
> projects may go far, but do not expect them to be capable of displacing
> R, SAS, etc.  For example, you can accomplish much by scanning a data
> set to determine N, sum X, sum X^X, etc., to produce descriptive stats,
> quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However,
> MapReduce requires data independence, so it will not serve well for
> tasks such as inverting a matrix.
>
> You might want to look into Parallel R, and talk with
> http://www.revolution-computing.com/
>
> Our team has a project which runs Hadoop workflows underneath R.  It is
> at an early stage, and there's no plan yet about a public release. It's
> not a simple thing to implement by any stretch of the imagination!
>
> Best,
> Paco
>
>
>
> On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
> <[EMAIL PROTECTED]> wrote:
>> Thanx Enis.
>>
>> By workflow, I was trying to mean something like a chain of MapReduce
>> jobs - the first one will extract a certain amount of data from the
>> original set and do some computation resulting in a smaller summary,
>> which will then be the input to a further MR job, and so on...somewhat
>
>> similar to a workflow as in the SOA world.
>>
>> Is it possible to use statistical analysis tools such as R (or say
>> PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is
>> working on a custom MapReduce engine over their Greenplum database
>> which will also support PL/R procedures.
>>
>> Arijit
>>
>> Dr. Arijit Mukherjee
>> Principal Member of Technical Staff, Level-II
>> Connectiva Systems (I) Pvt. Ltd.
>> J-2, Block GP, Sector V, Salt Lake
>> Kolkata 700 091, India
>> Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com
>>
> No virus found in this incoming message.
> Checked by AVG - http://www.avg.com
> Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
> 9/23/2008 6:32 PM
>
>
>

Re: Questions about Hadoop

Reply via email to