[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Wang Shouyan (JIRA) Wed, 03 Mar 2010 21:56:53 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12841070#action_12841070
 ]


Wang Shouyan commented on MAPREDUCE-1270:
-----------------------------------------

"In terms of apis, as I previously mentioned I stronly recommend you start 
using the Hadoop Pipes apis and enhance it - this will ensure compatibility 
between Hadoop Pipes and HCE - again, please consider moving the 
sort/shuffle/merge to Hadoop Pipes as I recommended previously."

I do not agree with this opinion,  if we  need to establish standards of c++ 
API, I don't think we need to completely compatible with pipes API，  because I 
don't think  pipes API is carefully considerated,   may be for compatibility of 
some other code, but never been  discussed  adequately。

If we do need a  C++ API , we should consider usability and extensibility more 
then compatibility,  because I don't  realize  such compatibility problem is a 
problem for most users .

If for usability and extensibility, any  suggestion is welcome.

> Hadoop C++ Extention
> --------------------
>
>                 Key: MAPREDUCE-1270
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>    Affects Versions: 0.20.1
>         Environment:  hadoop linux
>            Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>    1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>    2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>    3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>    What we want to do: 
>    1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>    2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>    3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>    at first, 1 and 2, then 3.  
>    What's the difference with PIPES:
>    1 Yes, We will reuse most PIPES code.
>    2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

Reply via email to