[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877270#action_12877270
 ] 

Jake Mannix commented on MAPREDUCE-1849:
----------------------------------------

+1 from this casual observer over from Mahout-land (nobody ever seems to 
believe me that this would make Hadoop programmers soooooo much more efficient).

I've written a half-baked, bug-ridden, inefficient version of this several 
times in the past, and it would be *so* useful to have done right.

An api which essentially wrapped a SequenceFile<K,V> and allowed you to do 
things like

  Path dataPath = new Path("hdfs://foo/bar");
  PTable<K,V> data = new PTable<K,V>(dataPath);
  LightWeightMap<K,V,KOUT,VOUT> map = new MyMapper();
  PTable<KOUT,VOUT> transformedData = data.parallelDo(map);

etc. would be awesome.

Of course, the real trick is writing a good optimizer which can figure out how 
to squish together separate M/R steps into one (for example, parallelDo() 
returns a PCollection, which you might then do groupByKey() on, but these could 
often easily be combined into the Map and Reduce steps of a single job).

> Implement a FlumeJava-like library for operations over parallel collections 
> using Hadoop MapReduce
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1849
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> The API used internally at Google is described in great detail at 
> http://portal.acm.org/citation.cfm?id=1806596.1806638.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to