[
https://issues.apache.org/jira/browse/HADOOP-3999?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kai Mosebach updated HADOOP-3999:
---------------------------------
Component/s: conf
benchmarks
Summary: Dynamic host configuration system (via node side plugins)
(was: Need to add host capabilites / abilities)
> Dynamic host configuration system (via node side plugins)
> ---------------------------------------------------------
>
> Key: HADOOP-3999
> URL: https://issues.apache.org/jira/browse/HADOOP-3999
> Project: Hadoop Core
> Issue Type: Improvement
> Components: benchmarks, conf, metrics
> Environment: Any
> Reporter: Kai Mosebach
>
> The MapReduce paradigma is limited to run MapReduce jobs with the lowest
> common factor of all nodes in the cluster.
> On the one hand this is wanted (cloud computing, throw simple jobs in,
> nevermind who does it)
> On the other hand this is limiting the possibilities quite a lot, for
> instance if you had data which could/needs to be fed to a 3rd party interface
> like Mathlab, R, BioConductor you could solve a lot more jobs via hadoop.
> Furthermore it could be interesting to know about the OS, the architecture,
> the performance of the node in relation to the rest of the cluster.
> (Performance ranking)
> i.e. if i'd know about a sub cluster of very computing performant nodes or a
> sub cluster of very fast disk-io nodes, the job tracker could select these
> nodes regarding a so called job profile (i.e. my job is a heavy computing job
> / heavy disk-io job), which can usually be estimated by a developer before.
> To achieve this, node capabilities could be introduced and stored in the DFS,
> giving you
> a1.) basic information about each node (OS, ARCH)
> a2.) more sophisticated infos (additional software, path to software,
> version).
> a3.) PKI collected about the node (disc-io, cpu power, memory)
> a4.) network throughput to neighbor hosts, which might allow generating a
> network performance map over the cluster
> This would allow you to
> b1.) generate jobs that have a profile (computing intensive, disk io
> intensive, net io intensive)
> b2.) generate jobs that have software dependencies (run on Linux only, run on
> nodes with MathLab only)
> b3.) generate a performance map of the cluster (sub clusters of fast disk
> nodes, sub clusters of fast CPU nodes, network-speed-relation-map between
> nodes)
> From step b3) you could then even acquire statistical information which could
> again be fed into the DFS Namenode to see if we could store data on fast disk
> subclusters only (that might need to be a tool outside of hadoop core though)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.