[ 
https://issues.apache.org/jira/browse/MAPREDUCE-64?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Douglas updated MAPREDUCE-64:
-----------------------------------

    Attachment: M64-4.patch

Merged with trunk.

Thanks for running the coverage tool, Todd.

bq. Since we're using the Local Runner for these tests, it's all a single 
partition

TestMiniMRDFSSort is the only test I know of that uses multiple partitions. I 
hope that will change after MAPREDUCE-1050.

bq. Line 1097 ("if (bufindex + headbytelen < avail) {" in void reset()) is 
always true in our tests. We should get a test case to exercise the other half 
of this branch.

This is awkward to write reliably, the patch already adds a lot of very 
specific test code, and the tested path is the more aggressive one. I'm OK 
leaving this for now.

bq. We don't current run any tests with job.getCompressMapOutput returning true
bq.  Line 1365 (kvstart >= kvend ternary in sortAndSpill) is always true.

Added cases to TestMapCollection.

bq. can you put in a small comment describing the synchronization policy for 
the various offsets? Those used to be volatile and now they're under a lock, so 
it should be good to note that in the code.

The patch- and this issue- adds a fair amount of documentation, including 
calling out some of the places where locking and visibility are non-trivial. 
The {{volatile}} modifier in the existing code was paranoid, if not redundant.

> Map-side sort is hampered by io.sort.record.percent
> ---------------------------------------------------
>
>                 Key: MAPREDUCE-64
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-64
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Arun C Murthy
>            Assignee: Chris Douglas
>         Attachments: M64-0.patch, M64-0i.png, M64-1.patch, M64-1i.png, 
> M64-2.patch, M64-2i.png, M64-3.patch, M64-4.patch
>
>
> Currently io.sort.record.percent is a fairly obscure, per-job configurable, 
> expert-level parameter which controls how much accounting space is available 
> for records in the map-side sort buffer (io.sort.mb). Typically values for 
> io.sort.mb (100) and io.sort.record.percent (0.05) imply that we can store 
> ~350,000 records in the buffer before necessitating a sort/combine/spill.
> However for many applications which deal with small records e.g. the 
> world-famous wordcount and it's family this implies we can only use 5-10% of 
> io.sort.mb i.e. (5-10M) before we spill inspite of having _much_ more memory 
> available in the sort-buffer. The word-count for e.g. results in ~12 spills 
> (given hdfs block size of 64M). The presence of a combiner exacerbates the 
> problem by piling serialization/deserialization of records too...
> Sure, jobs can configure io.sort.record.percent, but it's tedious and 
> obscure; we really can do better by getting the framework to automagically 
> pick it by using all available memory (upto io.sort.mb) for either the data 
> or accounting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to