[ https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877270#action_12877270 ]
Jake Mannix commented on MAPREDUCE-1849: ---------------------------------------- +1 from this casual observer over from Mahout-land (nobody ever seems to believe me that this would make Hadoop programmers soooooo much more efficient). I've written a half-baked, bug-ridden, inefficient version of this several times in the past, and it would be *so* useful to have done right. An api which essentially wrapped a SequenceFile<K,V> and allowed you to do things like Path dataPath = new Path("hdfs://foo/bar"); PTable<K,V> data = new PTable<K,V>(dataPath); LightWeightMap<K,V,KOUT,VOUT> map = new MyMapper(); PTable<KOUT,VOUT> transformedData = data.parallelDo(map); etc. would be awesome. Of course, the real trick is writing a good optimizer which can figure out how to squish together separate M/R steps into one (for example, parallelDo() returns a PCollection, which you might then do groupByKey() on, but these could often easily be combined into the Map and Reduce steps of a single job). > Implement a FlumeJava-like library for operations over parallel collections > using Hadoop MapReduce > -------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-1849 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Jeff Hammerbacher > > The API used internally at Google is described in great detail at > http://portal.acm.org/citation.cfm?id=1806596.1806638. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.