[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

Chris Douglas (JIRA) Mon, 29 Aug 2011 16:14:05 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093291#comment-13093291
 ]


Chris Douglas commented on MAPREDUCE-2841:
------------------------------------------

bq. If the java impl use the similar impl as the c++ one here, the only 
difference will be language. right?

Yes, but the language difference includes other overheads (more below).

bq. Sorry, can you explain more about how the c++ can do a better job here for 
predictable memory footprint? in the current java impl, all records (no matter 
which reducer it is going) are stored in a central byte array. In the c++ impl, 
on one mapper task, each reducer will have one corresponding partition bucket 
which maintains its own memory buffer. From what i understand, one partition 
bucket is for one reducer. and all records going to that reducer from the 
current maptask are stored there, will be sorted and spilled from there.

Each partition bucket maintins its own memory buffer, so the memory consumed by 
the collection framework includes the unused space in all the partition 
buffers. I'm calling that, possibly imprecisely, internal fragmentation. The 
{{RawComparator}} interface also requires that keys be contiguous, introducing 
other "waste" if the partition's collection buffer were not copied whenever it 
is expanded (as in 0.16; the expansion/copying overhead also harms performance 
and makes memory usage hard to predict because both src and dst buffers exist 
simultaneously), i.e. a key partially serialized at the end of a slab must be 
realigned in a new slab. This happens at the end of the circular buffer in the 
current implementation, but would happen on the boundary of every partition 
collector chunk.

That internal fragmentation creates unused buffer space that "prematurely" 
triggers a spill to reclaim the memory. Allocating smaller slabs decreases 
internal fragmentation, but also adds an ~8 byte object tracking overhead and 
GC cycles. In contrast, large allocations (like the single collection buffer) 
are placed directly in permgen. The 4 byte overhead per record to track the 
partition is a space savings over slabs exactly matching each record size, 
requiring at least 8 bytes per record if naively implemented.

The current implementation is oriented toward stuffing the most records into a 
precisely fixed amount of memory, and adopts a few assumptions: 1) one should 
spill as little as possible 2) if spilling is required, at least don't block 
the mapper 3) packing the most records into each spill favors MapTasks with 
combiners. If there are cases (we all acknowledge that there are) where 
spilling more often but _faster_ can compensate for that difference, then it's 
worth reexamining those assumptions.

> Task level native optimization
> ------------------------------
>
>                 Key: MAPREDUCE-2841
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2841
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task
>         Environment: x86-64 Linux
>            Reporter: Binglin Chang
>            Assignee: Binglin Chang
>         Attachments: MAPREDUCE-2841.v1.patch, dualpivot-0.patch, 
> dualpivotv20-0.patch
>
>
> I'm recently working on native optimization for MapTask based on JNI. 
> The basic idea is that, add a NativeMapOutputCollector to handle k/v pairs 
> emitted by mapper, therefore sort, spill, IFile serialization can all be done 
> in native code, preliminary test(on Xeon E5410, jdk6u24) showed promising 
> results:
> 1. Sort is about 3x-10x as fast as java(only binary string compare is 
> supported)
> 2. IFile serialization speed is about 3x of java, about 500MB/s, if hardware 
> CRC32C is used, things can get much faster(1G/s).
> 3. Merge code is not completed yet, so the test use enough io.sort.mb to 
> prevent mid-spill
> This leads to a total speed up of 2x~3x for the whole MapTask, if 
> IdentityMapper(mapper does nothing) is used.
> There are limitations of course, currently only Text and BytesWritable is 
> supported, and I have not think through many things right now, such as how to 
> support map side combine. I had some discussion with somebody familiar with 
> hive, it seems that these limitations won't be much problem for Hive to 
> benefit from those optimizations, at least. Advices or discussions about 
> improving compatibility are most welcome:) 
> Currently NativeMapOutputCollector has a static method called canEnable(), 
> which checks if key/value type, comparator type, combiner are all compatible, 
> then MapTask can choose to enable NativeMapOutputCollector.
> This is only a preliminary test, more work need to be done. I expect better 
> final results, and I believe similar optimization can be adopt to reduce task 
> and shuffle too. 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2841) Task level native optimization

Reply via email to