Hadoop C++ Extention
--------------------
Key: MAPREDUCE-1270
URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
Project: Hadoop Map/Reduce
Issue Type: Improvement
Components: task
Affects Versions: 0.20.1
Environment: hadoop linux
Reporter: Wang Shouyan
Hadoop C++ extension is an internal project in baidu, We start it for these
reasons:
1 To provide C++ API. We mostly use Streaming before, and we also try to
use PIPES, but we do not find PIPES is more efficient than Streaming. So we
think a new C++ extention is needed for us.
2 Even using PIPES or Streaming, it is hard to control memory of hadoop
map/reduce Child JVM.
3 It costs so much to read/write/sort TB/PB data by Java. When using PIPES
or Streaming, pipe or socket is not efficient to carry so huge data.
What we want to do:
1 We do not use map/reduce Child JVM to do any data processing, which just
prepares environment, starts C++ mapper, tells mapper which split it should
deal with, and reads report from mapper until that finished. The mapper will
read record, ivoke user defined map, to do partition, write spill, combine and
merge into file.out. We think these operations can be done by C++ code.
2 Reducer is similar to mapper, it was started after sort finished, it read
from sorted files, ivoke user difined reduce, and write to user defined record
writer.
3 We also intend to rewrite shuffle and sort with C++, for efficience and
memory control.
at first, 1 and 2, then 3.
What's the difference with PIPES:
1 Yes, We will reuse most PIPES code.
2 And, We should do it more completely, nothing changed in scheduling and
management, but everything in execution.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.