[jira] [Comment Edited] (MAPREDUCE-2841) Task level native optimization

Arun C Murthy (JIRA) Tue, 24 Jun 2014 01:32:29 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14041816#comment-14041816
 ]


Arun C Murthy edited comment on MAPREDUCE-2841 at 6/24/14 8:30 AM:
-------------------------------------------------------------------

Todd,

bq. I agree that building a completely parallel C++ MR runtime is a much larger 
project that should not be part of Hadoop. 

I'm confused. There already exists large amounts of code on the github for a 
the full task runtime. Is that abandoned? Are you saying there no intention to 
contribute that to Hadoop, ever? Why would that be? Would that be a separate 
project?

With or without ABI, C++ still is a major problem w.r.t different compiler 
versions, different platforms we support etc. That is precisely why 
HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more 
comfortable, aside from the disparity in skills in the Hadoop community. 

Furthermore, there are considerably more security issues which open up in C++ 
land such as buffer overflow etc.

----

bq. I think the 75k you're counting may include the auto-generated shell 
scripts.

>From the github:

{noformat}
$ find . -name *.java | xargs wc -l
   11988 total
$ find . -name *.h | xargs wc -l
   27269 total
$ find . -name *.cc | xargs wc -l
   26276 total
{noformat}

Whether it's test or non-test, we are still importing a *lot* of code - code 
for which the Hadoop community does need to maintain?

----

bq. So, it's not a tiny import by any means, but for 2x improvement on terasort 
wallclock, my opinion is that the maintenance burden is worth it.

Todd, as we both know, there are many, many ways to get 2x improvement on 
terasort...
... nor is it worth a lot in real-world outside of benchmarks. 

I'm sure we both would take 2x on Pig/Hive anyday... *smile*

----

bq. As for importing to Tez, I don't think the community has generally agreed 
to EOL MapReduce

Regardless of whether or not we pull this into MR, it would be useful to pull 
it into Tez too - if Sean wants to do it. Let's not discourage them.

I'm sure we both agree, and want to see real world workloads improve and that 
Hive/Pig/Cascading etc. represent that.

IAC, hopefully we can stop this meme that I'm trying to *preclude* you from 
doing anything regardless of my beliefs. IAC, we both realize MR is reasonably 
stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/

Essentially, you asked for feedback from the MapReduce community; and this is 
my honest feedback - as someone who has actively helped maintain this codebase 
for more than 8 years now. So, I'd appreciate if we don't misinterpret each 
others' technical opinions and concerns during this discussion. Thanks in 
advance.

FTR: I'll restate my concerns about C++, roadmap for C++ runtime, 
maintainability, support for all of Hadoop (new security bugs, future security 
features, platforms etc.). 

Furthermore, this jira was opened nearly 3 years ago and only has sporadic 
bursts of activity - not a good sign for long-term maintainability.

I've stated my concerns, let's try get through them by focussing on those 
aspects.

----

Finally, what is the concern you see with starting this as an incubator project 
and allowing folks to develop a community around it? We can certainly help on 
our end by making it easy for them to plug in via interfaces etc. 

Thanks.


was (Author: acmurthy):
Todd,

bq. I agree that building a completely parallel C++ MR runtime is a much larger 
project that should not be part of Hadoop. 

I'm confused. There already exists large amounts of code on the github for a 
the full task runtime. Is that abandoned? Are you saying there no intention to 
contribute that to Hadoop, ever? Why would that be? Would that be a separate 
project?

With or without ABI, C++ still is a major problem w.r.t different compiler 
versions, different platforms we support etc. That is precisely why 
HADOOP-10388 chose to use pure-C only. A similar switch makes me *much* more 
comfortable, aside from the disparity in skills in the Hadoop community. 

Furthermore, there are considerably more security issues which open up in C++ 
land such as buffer overflow etc.

----

bq. I think the 75k you're counting may include the auto-generated shell 
scripts.

>From the github:

{noformat}
$ find . -name *.java | xargs wc -l
   11988 total
$ find . -name *.h | xargs wc -l
   27269 total
$ find . -name *.cc | xargs wc -l
   26276 total
{noformat}

Whether it's test or non-test, we are still importing a *lot* of code - code 
for which the Hadoop community does need to maintain?

----

bq. So, it's not a tiny import by any means, but for 2x improvement on terasort 
wallclock, my opinion is that the maintenance burden is worth it.

Todd, as we both know, there are many, many ways to get 2x improvement on 
terasort...
... nor is it worth a lot in real-world outside of benchmarks. 

I'm sure we both would take 2x on Pig/Hive anyday... *smile*

----

bq. As for importing to Tez, I don't think the community has generally agreed 
to EOL MapReduce

Regardless of whether or not we pull this into MR, it would be useful to pull 
it into Tez too - if Sean wants to do it. Let's not discourage them.

I'm sure we both agree, and want to see real world workloads improve and that 
Hive/Pig/Cascading etc. represent that.

IAC, hopefully we can stop this meme that I'm trying to *preclude* you from 
doing anything regardless of my religious beliefs. IAC, we both realize MR is 
reasonably stable and won't get a lot of investment, and so do our employers:
http://vision.cloudera.com/mapreduce-spark/
http://hortonworks.com/hadoop/tez/

So, I'd appreciate if we don't misinterpret each others' technical opinions and 
concerns during this discussion. Thanks.

FTR: I'll restate my concerns about C++, roadmap for C++ runtime, 
maintainability, support for all of Hadoop (security, platforms etc.). 

Furthermore, this jira was opened nearly 3 years ago and only has sporadic 
bursts of activity - not a good sign for long-term maintainability.

I've stated my concerns, let's try get through them by focussing on those 
aspects.

----

Finally, what is the concern you see with starting this as an incubator project 
and allowing folks to develop a community around it? We can certainly help on 
our end by making it easy for them to plug in via interfaces etc. 

Thanks.

> Task level native optimization
> ------------------------------
>
>                 Key: MAPREDUCE-2841
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>         Environment: x86-64 Linux/Unix
>            Reporter: Binglin Chang
>            Assignee: Sean Zhong
>         Attachments: DESIGN.html, MAPREDUCE-2841.v1.patch, 
> MAPREDUCE-2841.v2.patch, dualpivot-0.patch, dualpivotv20-0.patch, 
> fb-shuffle.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAPREDUCE-2841) Task level native optimization

Reply via email to