[
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
luoxu updated MAPREDUCE-1270:
------------------------------
Affects Version/s: (was: 2.6.2)
> Hadoop C++ Extention
> --------------------
>
> Key: MAPREDUCE-1270
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Affects Versions: 0.20.1
> Environment: hadoop linux
> Reporter: Wang Shouyan
> Attachments: HADOOP-HCE-1.0.0.patch, HCE InstallMenu.pdf, HCE
> Performance Report.pdf, HCE Tutorial.pdf, Overall Design of Hadoop C++
> Extension.doc
>
>
> Hadoop C++ extension is an internal project in baidu, We start it for these
> reasons:
> 1 To provide C++ API. We mostly use Streaming before, and we also try to
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we
> think a new C++ extention is needed for us.
> 2 Even using PIPES or Streaming, it is hard to control memory of hadoop
> map/reduce Child JVM.
> 3 It costs so much to read/write/sort TB/PB data by Java. When using
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
> What we want to do:
> 1 We do not use map/reduce Child JVM to do any data processing, which just
> prepares environment, starts C++ mapper, tells mapper which split it should
> deal with, and reads report from mapper until that finished. The mapper will
> read record, ivoke user defined map, to do partition, write spill, combine
> and merge into file.out. We think these operations can be done by C++ code.
> 2 Reducer is similar to mapper, it was started after sort finished, it
> read from sorted files, ivoke user difined reduce, and write to user defined
> record writer.
> 3 We also intend to rewrite shuffle and sort with C++, for efficience and
> memory control.
> at first, 1 and 2, then 3.
> What's the difference with PIPES:
> 1 Yes, We will reuse most PIPES code.
> 2 And, We should do it more completely, nothing changed in scheduling and
> management, but everything in execution.
> *UPDATE:*
> Now you can get a test version of HCE from this link
> http://docs.google.com/leaf?id=0B5xhnqH1558YZjcxZmI0NzEtODczMy00NmZiLWFkNjAtZGM1MjZkMmNkNWFk&hl=zh_CN&pli=1
> This is a full package with all hadoop source code.
> Following document "HCE InstallMenu.pdf" in attachment, you will build and
> deploy it in your cluster.
> Attachment "HCE Tutorial.pdf" will lead you to write the first HCE program
> and give other specifications of the interface.
> Attachment "HCE Performance Report.pdf" gives a performance report of HCE
> compared to Java MapRed and Pipes.
> Any comments are welcomed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)