Certainly. It'd be great to talk with others working in analytics and statistical computing, who have been evaluating MapReduce as well.
Paco On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee <[EMAIL PROTECTED]> wrote: > That's a very good overview Paco - thanx for that. I might get back to > you with more queries about cascade etc. at some time - hope you > wouldn't mind. > > Regards > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 > http://www.connectivasystems.com > > > -----Original Message----- > From: Paco NATHAN [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 6:10 PM > To: [email protected]; [EMAIL PROTECTED] > Subject: Re: Questions about Hadoop > > > Arijit, > > For workflow, check out http://cascading.org -- that works quite well > and fits what you described. > > Greenplum and Aster Data have announced support for running MR within > the context of their relational databases, e.g., > http://www.greenplum.com/resources/mapreduce/ > > In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite > good for situations where there are lots of ad hoc queries, business > intelligence needs short-term, less-technical staff involved. However, > if there are large, repeated batch jobs which require significant > analytics work, then I'm not so convinced that SQL is the right mind-set > for representing the math required for algorithms or for maintaining > complex code throughout the software lifecycle. > > > I run an analytics group where our statisticians use R, while our > developers use Hadoop, Cascading, etc., at scale on terabytes. One > approach is simply to sample data, analyze it in R, then use the > analysis to articulate requirements for developers to use at scale. > > In terms of running R on large data, one issue is that -- in contrast to > SAS, where data is handled line-by-line -- R is limited by how much data > can be loaded into memory. > > Another issue is that while some areas of statistical data analysis are > suitable for MapReduce, others clearly are not. Mahout or similar > projects may go far, but do not expect them to be capable of displacing > R, SAS, etc. For example, you can accomplish much by scanning a data > set to determine N, sum X, sum X^X, etc., to produce descriptive stats, > quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, > MapReduce requires data independence, so it will not serve well for > tasks such as inverting a matrix. > > You might want to look into Parallel R, and talk with > http://www.revolution-computing.com/ > > Our team has a project which runs Hadoop workflows underneath R. It is > at an early stage, and there's no plan yet about a public release. It's > not a simple thing to implement by any stretch of the imagination! > > Best, > Paco > > > > On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee > <[EMAIL PROTECTED]> wrote: >> Thanx Enis. >> >> By workflow, I was trying to mean something like a chain of MapReduce >> jobs - the first one will extract a certain amount of data from the >> original set and do some computation resulting in a smaller summary, >> which will then be the input to a further MR job, and so on...somewhat > >> similar to a workflow as in the SOA world. >> >> Is it possible to use statistical analysis tools such as R (or say >> PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is >> working on a custom MapReduce engine over their Greenplum database >> which will also support PL/R procedures. >> >> Arijit >> >> Dr. Arijit Mukherjee >> Principal Member of Technical Staff, Level-II >> Connectiva Systems (I) Pvt. Ltd. >> J-2, Block GP, Sector V, Salt Lake >> Kolkata 700 091, India >> Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com >> > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: > 9/23/2008 6:32 PM > > >
