[
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13092596#comment-13092596
]
Chris Douglas commented on MAPREDUCE-2841:
------------------------------------------
{quote}Just an idea, what if memory related configurations can be a random
variable,
with mean & variance? Can this leads to better resource utilization? A fixed
memory bound
always means application will request more memory than they really need.
I think in many cases predictable memory control is enough, rather than precise
memory control,
since it's impractical. We can use some dynamic memory if it is in a predicable
range,
for example +/-20%, +-30%, etc.{quote}
The fixed memory bound definitely causes resource waste. Not only will users
ask for more memory than they need (particularly since most applications are
not tightly tuned), but in our clusters, users will just as often request far
too little. Because tasks' memory management is uniformly specified within a
job, there isn't even an opportunity for the framework to adapt to skew.
The random memory config is an interesting idea, but failed tasks are
regrettable and expensive waste. For pipelines with SLAs, "random" failures
will probably motivate users to jack up their memory requirements to match the
range (which, if configurable, seems to encode the same contract). The precise
specification was avoiding OOMs; because the collection is across a JNI
boundary, a "relaxed" predictable memory footprint could be easier to deploy,
assuming a hard limit in the native code to avoid swapping.
Thanks for the detail on the collection data structures. That makes it much
easier to orient oneself in the code.
A few quick notes on your
[earlier|https://issues.apache.org/jira/browse/MAPREDUCE-2841?focusedCommentId=13086973&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13086973]
comment:
Adding the partition to the record, again, was to make the memory more
predictable. The overhead in Java tracking thousands of per-partition buckets
(many going unused) was worse than the per-record overhead, particularly in
large jobs. Further, user comparators are often horribly inefficient, so the
partition comparison and related hit to its performance was in the noise. The
cache miss is real, but hard to reason about without leaving the JVM.
The decorator-based stream is/was? required by the serialization interface.
While the current patch only supports records with a known serialized length,
the contract for other types is more general. Probably too general, but users
with occasional several-hundred MB records (written in chunks) exist.
Supporting that in this implementation is not a critical use case, since they
can just use the existing collector. Tuning this to handle memcmp types could
also put the burden of user comparators on the serialization frameworks, which
is probably the best strategy. Which is to say: obsoleting the existing
collection framework doesn't require that this support all of its use cases, if
some of those can be worked around more competently elsewhere. If its principal
focus is performance, it may make sense not to support inherently slow
semantics.
Which brings up a point: what is the scope of this JIRA? A full, native task
runtime is a formidable job. Even if it only supported memcmp key types, no
map-side combiner, no user-defined comparators, and records smaller than its
intermediate buffer, such an improvement would still cover a lot of user jobs.
It might make sense to commit that subset as optional functionality first, then
iterate based on feedback.
> Task level native optimization
> ------------------------------
>
> Key: MAPREDUCE-2841
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
> Project: Hadoop Map/Reduce
> Issue Type: Improvement
> Components: task
> Environment: x86-64 Linux
> Reporter: Binglin Chang
> Assignee: Binglin Chang
> Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch,
> dualpivotv20-0.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI.
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs
> emitted by mapper, therefore sort, spill, IFile serialization can all be done
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware
> CRC32C is used, things can get much faster(1G/s).
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if
> IdentityMapper(mapper does nothing) is used.
> There are limitations of course, currently only Text and BytesWritable is
> supported, and I have not think through many things right now, such as how to
> support map side combine. I had some discussion with somebody familiar with
> hive, it seems that these limitations won't be much problem for Hive to
> benefit from those optimizations, at least. Advices or discussions about
> improving compatibility are most welcome:)
> Currently NativeMapOutputCollector has a static method called canEnable(),
> which checks if key/value type, comparator type, combiner are all compatible,
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better
> final results, and I believe similar optimization can be adopt to reduce task
> and shuffle too.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira