[
https://issues.apache.org/jira/browse/HADOOP-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Devaraj Das updated HADOOP-3366:
--------------------------------
Attachment: 3366.1.patch
(An offline discussion led me to agree to the suggestion that we should not
have the file abstraction for the in memory merge. The file streams adds
overhead which is not desirable in a performance critical section.)
This half-done patch is up for a high-level review. It introduces a
ByteArrayManager that shuffle can use to store files as raw byte-arrays instead
of files in the ramfs. It also defines a merge routine that can merge a bunch
of such byte-arrays. There is some dependency of the remaining work, i.e.,
changing the shuffle code to use the ByteArrayManager instead of the ramfs, on
the patch for HADOOP-2095 (since that patch changes the layout of the
intermediate sequence file). I'll see what else can be done without that patch
being available.
By the way, I have done the patch assuming the layout as
<key-len><val-len><key><value> (the difference w.r.t the earlier proposed
layout is that the lengths are together). That made the parsing of the byte
arrays simpler.
> Shuffle/Merge improvements
> --------------------------
>
> Key: HADOOP-3366
> URL: https://issues.apache.org/jira/browse/HADOOP-3366
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
> Fix For: 0.18.0
>
> Attachments: 3366.1.patch
>
>
> This is intended to be a meta-issue to track various improvements to
> shuffle/merge in the reducer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.