[ https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877224#action_12877224 ]
Jeff Hammerbacher commented on MAPREDUCE-1849: ---------------------------------------------- h2. Data Model * "The central class of the FlumeJava library is *PCollection<T>*, a (possibly huge) immutable bag of elements of type T." ** Can be unordered (collection) or ordered (sequence) ** Could be created with an underlying Java Collection<T> for local execution ** recordsOf() can be used to indicate how to read the elements of the collection (cf. Pig's LoadFunc or Hive's SerDe) * Second central class: *PTable<K, V>* ** Immutable multi-map with keys of class K and values of class V ** Subclass of PCollection<Pair<K, V>> h2. Operators * parallelDo(PCollection<T>): PCollection<S>; runs S doFunc(T) over each element * groupByKey(PTable<Pair<K,V>>): PTable<Pair<K, Collection<V>>>: turns a multi-map into a uni-map * combineValues(PTable<Pair<K, Collection<V>>): PTable<Pair<K, V>>: does the reduction * flatten(): logical view of multiple PCollections as one PCollection * writeToRecordFileTable() to flush the output of a pipeline to a table > Implement a FlumeJava-like library for operations over parallel collections > using Hadoop MapReduce > -------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-1849 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Reporter: Jeff Hammerbacher > > The API used internally at Google is described in great detail at > http://portal.acm.org/citation.cfm?id=1806596.1806638. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.