[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12877224#action_12877224
 ] 

Jeff Hammerbacher commented on MAPREDUCE-1849:
----------------------------------------------

h2. Data Model
* "The central class of the FlumeJava library is *PCollection<T>*, a (possibly 
huge) immutable bag of elements of type T."
** Can be unordered (collection) or ordered (sequence)
** Could be created with an underlying Java Collection<T> for local execution
** recordsOf() can be used to indicate how to read the elements of the 
collection (cf. Pig's LoadFunc or Hive's SerDe)
* Second central class: *PTable<K, V>*
** Immutable multi-map with keys of class K and values of class V
** Subclass of PCollection<Pair<K, V>> 

h2. Operators
* parallelDo(PCollection<T>): PCollection<S>; runs S doFunc(T) over each element
* groupByKey(PTable<Pair<K,V>>): PTable<Pair<K, Collection<V>>>: turns a 
multi-map into a uni-map
* combineValues(PTable<Pair<K, Collection<V>>): PTable<Pair<K, V>>: does the 
reduction
* flatten(): logical view of multiple PCollections as one PCollection
* writeToRecordFileTable() to flush the output of a pipeline to a table

> Implement a FlumeJava-like library for operations over parallel collections 
> using Hadoop MapReduce
> --------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1849
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1849
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>            Reporter: Jeff Hammerbacher
>
> The API used internally at Google is described in great detail at 
> http://portal.acm.org/citation.cfm?id=1806596.1806638.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to