Sort buffer size (io.sort.mb) is limited to < 2 GB
--------------------------------------------------

                 Key: MAPREDUCE-2308
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2308
             Project: Hadoop Map/Reduce
          Issue Type: Bug
    Affects Versions: 0.21.0, 0.20.2, 0.20.1
         Environment: Cloudera CDH3b3 (0.20.2+)
            Reporter: Jay Hacker
            Priority: Minor


I have MapReduce jobs that use a large amount of per-task memory, because the 
algorithm I'm using converges faster if more data is together on a node.  I 
have my JVM heap size set at 3200 MB, and if I use the popular rule of thumb 
that io.sort.mb should be ~70% of that, I get 2240 MB.  I rounded this down to 
2048 MB, but map tasks crash with :
{noformat}
java.io.IOException: Invalid "io.sort.mb": 2048
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.<init>(MapTask.java:790)
        ...
{noformat}

MapTask.MapOutputBuffer implements its buffer with a byte[] of size io.sort.mb 
(in bytes), and is sanity checking the size before allocating the array.  The 
problem is that Java arrays can't have more than 2^31 - 1 elements (even with a 
64-bit JVM), and this is a limitation of the Java language specificiation 
itself.  As memory and data sizes grow, this would seem to be a crippling 
limtiation of Java.

It would be nice if this ceiling were documented, and an error issued sooner, 
e.g. in jobtracker startup upon reading the config.  Going forward, we may need 
to implement some array of arrays hack for large buffers. :(

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to