[jira] [Commented] (MAPREDUCE-6923) YARN Shuffle I/O for small partitions

Robert Schmidtke (JIRA) Fri, 04 Aug 2017 00:59:11 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16114083#comment-16114083
 ]


Robert Schmidtke commented on MAPREDUCE-6923:
---------------------------------------------

I am using bytecode instrumentation to log every read and write request going 
through the core Java I/O classes. I do this for every JVM started (Yarn, Map, 
Reduce, Hdfs, ...), and log statistics over the entire TeraSort run. The 
aggregated statistics from there agree to 97-99% (for reads and writes, 
respectively) with what the underlying XFS file system counters report. Hence I 
assume that my instrumentation is pretty accurate, giving 1169 GiB for all Yarn 
I/O. I see that some 1.5 GiB is spent on reading the mapreduce jar files (in 
Yarn), and another 1.2 GiB is spent reading jar files in /usr/lib/jvm. However, 
there most likely is caching involved, and I wouldn't be sure about how much 
I/O actually happened at this level.

I have instrumented reading zip and jar files separately, and over the course 
of all map tasks (TeraGen + TeraSort), my instrumentation gives a total of 
{{638 GiB / (2048 + 2048) = 159.5 MiB}} per mapper, and {{337 GiB / 2048 = 
168.5 MiB}} per reducer. However I wouldn't rely too much on these numbers, 
because if I added them to the regular I/O induced by reading/writing the 
input/output, shuffle and spill, then my numbers wouldn't agree any longer with 
the XFS counters.

{quote}
On some installations I've seen the JVM load close to 400Mb of jar files for 
hadoop and its dependencies. Even on trunk my MapTask reads about 180Mb of jars 
atleast.
{quote}

Do you mean that Yarn should exhibit this I/O, or would I see this in the map 
and reduce JVMs (as explained above)?

> YARN Shuffle I/O for small partitions
> -------------------------------------
>
>                 Key: MAPREDUCE-6923
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6923
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>         Environment: Observed in Hadoop 2.7.3 and above (judging from the 
> source code of future versions), and Ubuntu 16.04.
>            Reporter: Robert Schmidtke
>            Assignee: Robert Schmidtke
>         Attachments: MAPREDUCE-6923.00.patch
>
>
> When a job configuration results in small partitions read by each reducer 
> from each mapper (e.g. 65 kilobytes as in my setup: a 
> [TeraSort|https://github.com/apache/hadoop/blob/branch-2.7.3/hadoop-mapreduce-project/hadoop-mapreduce-examples/src/main/java/org/apache/hadoop/examples/terasort/TeraSort.java]
>  of 256 gigabytes using 2048 mappers and reducers each), and setting
> {code:xml}
> <property>
>   <name>mapreduce.shuffle.transferTo.allowed</name>
>   <value>false</value>
> </property>
> {code}
> then the default setting of
> {code:xml}
> <property>
>   <name>mapreduce.shuffle.transfer.buffer.size</name>
>   <value>131072</value>
> </property>
> {code}
> results in almost 100% overhead in reads during shuffle in YARN, because for 
> each 65K needed, 128K are read.
> I propose a fix in 
> [FadvisedFileRegion.java|https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-shuffle/src/main/java/org/apache/hadoop/mapred/FadvisedFileRegion.java#L114]
>  as follows:
> {code:java}
> ByteBuffer byteBuffer = ByteBuffer.allocate(Math.min(this.shuffleBufferSize, 
> trans > Integer.MAX_VALUE ? Integer.MAX_VALUE : (int) trans));
> {code}
> e.g. 
> [here|https://github.com/apache/hadoop/compare/branch-2.7.3...robert-schmidtke:adaptive-shuffle-buffer].
>  This sets the shuffle buffer size to the minimum value of the shuffle buffer 
> size specified in the configuration (128K by default), and the actual 
> partition size (65K on average in my setup). In my benchmarks this reduced 
> the read overhead in YARN from about 100% (255 additional gigabytes as 
> described above) down to about 18% (an additional 45 gigabytes). The runtime 
> of the job remained the same in my setup.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (MAPREDUCE-6923) YARN Shuffle I/O for small partitions

Reply via email to