[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041816#comment-14041816
]
Arun C Murthy edited comment on MAPREDUCE-2841 at 6/24/14 8:30 AM:
-------------------------------------------------------------------
Todd,
bq. I agree that building a completely parallel C++ MR runtime is a much larger
project that should not be part of Hadoop.
I'm confused. There already exists large amounts of code on the github for a
the full task runtime. Is that abandoned? Are you saying there no intention to
contribute that to Hadoop, ever? Why would that be? Would that be a separate
project?
With or without ABI, C++ still is a major problem w.r.t different compiler
versions, different platforms we support etc. That is precisely why
HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more
comfortable, aside from the disparity in skills in the Hadoop community.
Furthermore, there are considerably more security issues which open up in C++
land such as buffer overflow etc.
----
bq. I think the 75k you're counting may include the auto-generated shell
scripts.
>From the github:
{noformat}
$ find . -name *.java | xargs wc -l
11988 total
$ find . -name *.h | xargs wc -l
27269 total
$ find . -name *.cc | xargs wc -l
26276 total
{noformat}
Whether it's test or non-test, we are still importing a *lot* of code - code
for which the Hadoop community does need to maintain?
----
bq. So, it's not a tiny import by any means, but for 2x improvement on terasort
wallclock, my opinion is that the maintenance burden is worth it.
Todd, as we both know, there are many, many ways to get 2x improvement on
terasort...
... nor is it worth a lot in real-world outside of benchmarks.
I'm sure we both would take 2x on Pig/Hive anyday... *smile*
----
bq. As for importing to Tez, I don't think the community has generally agreed
to EOL MapReduce
Regardless of whether or not we pull this into MR, it would be useful to pull
it into Tez too - if Sean wants to do it. Let's not discourage them.
I'm sure we both agree, and want to see real world workloads improve and that
Hive/Pig/Cascading etc. represent that.
IAC, hopefully we can stop this meme that I'm trying to *preclude* you from
doing anything regardless of my beliefs. IAC, we both realize MR is reasonably
stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/
Essentially, you asked for feedback from the MapReduce community; and this is
my honest feedback - as someone who has actively helped maintain this codebase
for more than 8 years now. So, I'd appreciate if we don't misinterpret each
others' technical opinions and concerns during this discussion. Thanks in
advance.
FTR: I'll restate my concerns about C++, roadmap for C++ runtime,
maintainability, support for all of Hadoop (new security bugs, future security
features, platforms etc.).
Furthermore, this jira was opened nearly 3 years ago and only has sporadic
bursts of activity - not a good sign for long-term maintainability.
I've stated my concerns, let's try get through them by focussing on those
aspects.
----
Finally, what is the concern you see with starting this as an incubator project
and allowing folks to develop a community around it? We can certainly help on
our end by making it easy for them to plug in via interfaces etc.
Thanks.
was (Author: acmurthy):
Todd,
bq. I agree that building a completely parallel C++ MR runtime is a much larger
project that should not be part of Hadoop.
I'm confused. There already exists large amounts of code on the github for a
the full task runtime. Is that abandoned? Are you saying there no intention to
contribute that to Hadoop, ever? Why would that be? Would that be a separate
project?
With or without ABI, C++ still is a major problem w.r.t different compiler
versions, different platforms we support etc. That is precisely why
HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more
comfortable, aside from the disparity in skills in the Hadoop community.
Furthermore, there are considerably more security issues which open up in C++
land such as buffer overflow etc.
----
bq. I think the 75k you're counting may include the auto-generated shell
scripts.
>From the github:
{noformat}
$ find . -name *.java | xargs wc -l
11988 total
$ find . -name *.h | xargs wc -l
27269 total
$ find . -name *.cc | xargs wc -l
26276 total
{noformat}
Whether it's test or non-test, we are still importing a *lot* of code - code
for which the Hadoop community does need to maintain?
----
bq. So, it's not a tiny import by any means, but for 2x improvement on terasort
wallclock, my opinion is that the maintenance burden is worth it.
Todd, as we both know, there are many, many ways to get 2x improvement on
terasort...
... nor is it worth a lot in real-world outside of benchmarks.
I'm sure we both would take 2x on Pig/Hive anyday... *smile*
----
bq. As for importing to Tez, I don't think the community has generally agreed
to EOL MapReduce
Regardless of whether or not we pull this into MR, it would be useful to pull
it into Tez too - if Sean wants to do it. Let's not discourage them.
I'm sure we both agree, and want to see real world workloads improve and that
Hive/Pig/Cascading etc. represent that.
IAC, hopefully we can stop this meme that I'm trying to *preclude* you from
doing anything regardless of my religious beliefs. IAC, we both realize MR is
reasonably stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/
So, I'd appreciate if we don't misinterpret each others' technical opinions and
concerns during this discussion. Thanks.
FTR: I'll restate my concerns about C++, roadmap for C++ runtime,
maintainability, support for all of Hadoop (security, platforms etc.).
Furthermore, this jira was opened nearly 3 years ago and only has sporadic
bursts of activity - not a good sign for long-term maintainability.
I've stated my concerns, let's try get through them by focussing on those
aspects.
----
Finally, what is the concern you see with starting this as an incubator project
and allowing folks to develop a community around it? We can certainly help on
our end by making it easy for them to plug in via interfaces etc.
Thanks.
> Task level native optimization
> ------------------------------
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Environment: x86-64 Linux/Unix
> Reporter: Binglin Chang
> Assignee: Sean Zhong
> Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch,
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch,
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI.
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs
> emitted by mapper, therefore sort, spill, IFile serialization can all be done
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is
> supported, and I have not think through many things right now, such as how to
> support map side combine. I had some discussion with somebody familiar with
> hive, it seems that these limitations won't be much problem for Hive to
> benefit from those optimizations, at least. Advices or discussions about
> improving compatibility are most welcome:)
> Currently NativeMapOutputCollector has a static method called canEnable(),
> which checks if key/value type, comparator type, combiner are all compatible,
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better
> final results, and I believe similar optimization can be adopt to reduce task
> and shuffle too.
--
This message was sent by Atlassian JIRA
(v6.2#6252)