[
https://issues.apache.org/jira/browse/HADOOP-5299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12676541#action_12676541
]
eric baldeschwieler commented on HADOOP-5299:
---------------------------------------------
I don't think you want to try to store map outs on reduce nodes. At least not
the sole copy.
1st off, this means when you loss a reduce node you will need to rerun every
map to regen a fraction of its data. Game over.
Also, in our system we don't fix all reduces before starting M/R. One could
argue that this is a strength or a weakness, but changing it would be a big
deal.
Other tricks could be applied. The simplest:
- Concatenate the output of a jobs maps on a node into a single file (or one
per slot). Then you only need to do one open / map node to shuffle if you are
efficient at fetching (we are implementing a shuffle fix that fetches data this
way anyway).
----
Again, the larger problem we want to solve here is to simply our management of
free space. By pushing this into HDFS, you get the TT out of the role of being
a intermediate data storage service. This has lots of advantages:
1) When a local disk fills up, HDFS can handle the problem.
2) Scheduling only needs to track global temporyary storage / job, not per
node. This is much simpler
3) We get to rip out the jetty server from TT nodes, simplifying security,
management, etc
4) Since all network traffic is now HDFS traffic , we can deal with hotspots
etc once
...
> Reducer inputs should be spilled to HDFS rather than local disk.
> ----------------------------------------------------------------
>
> Key: HADOOP-5299
> URL: https://issues.apache.org/jira/browse/HADOOP-5299
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Affects Versions: 0.19.0
> Environment: All
> Reporter: Milind Bhandarkar
>
> Currently, both map outputs and reduce inputs are stored on local disks of
> tasktrackers. (Un) Availability of local disk space for intermediate data is
> seen as a major factor in job failures.
> The suggested solution is to store these intermediate data on HDFS (maybe
> with replication factor of 1). However, the main blocker issue with that
> solution is that lots of temporary names (proportional to total number of
> maps), can overwhelm the namenode, especially since the map outputs are
> typically small (most produce one block output).
> Also, as we see in many applications, the map outputs can be estimated more
> accurately, and thus users can plan accordingly, based on available local
> disk space.
> However, the reduce input sizes can vary a lot, especially for skewed data
> (or because of bad partitioning.)
> So, I suggest that it makes more sense to keep map outputs on local disks,
> but the reduce inputs (when spilled from reducer memory) should go to HDFS.
> Adding a configuration variable to indicate the filesystem to be used for
> reduce-side spills would let us experiment and compare the efficiency of this
> new scheme.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.