[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15336923#comment-15336923
 ] 

Daniel Templeton commented on MAPREDUCE-6712:
---------------------------------------------

Pyspark actually does some cleverness under covers to replace as much of your 
python code with Scala as possible.  Regardless, the advantages of Spark over 
MR in performance are huge.  Not being forced to serialize to HDFS between 
chained jobs and being able to cache data in memory are a big deal.

> Support grouping values for reducer on java-side
> ------------------------------------------------
>
>                 Key: MAPREDUCE-6712
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6712
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: He Tianyi
>            Priority: Minor
>
> In hadoop streaming, with TextInputWriter, reducer program will receive each 
> line representing a (k, v) tuple from {{stdin}}, in which values with 
> identical key is not grouped.
> This brings some inefficiency, especially for runtimes based on interpreter 
> (e.g. cpython), coming from:
> A. user program has to compare key with previous one (but on java side, 
> records already come to reducer in groups),
> B. user program has to perform {{read}}, then {{find}} or {{split}} on each 
> record. even if there are multiple values with identical key,
> C. if length of key is large, apparently this introduces inefficiency for 
> caching,
> Suppose we need another InputWriter. But this is not enough, since the 
> interface of {{InputWriter}} defined {{writeKey}} and {{writeValue}}, not 
> {{writeValues}}. Though we can compare key in custom InputWriter and group 
> them, but this is also inefficient. Some other changes are also needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org

Reply via email to