[
https://issues.apache.org/jira/browse/HADOOP-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Douglas updated HADOOP-3062:
----------------------------------
Attachment: 3062-0.patch
First draft.
Format:
{noformat}
<log4j schema including timestamp, etc.> src: <src IP>, dest: <dst IP>, bytes:
<bytes>, op: <op enum>, id: <DFSClient id|taskid>[, blockid: <block id>]
{noformat}
The patch adds the DFSClient clientName to OP_READ_BLOCK and changes the String
in OP_WRITE_BLOCK from the path- which is unused- to the clientName. Is this is
set to DFSClient_<taskid> in map and reduce tasks, tracing the output of a job
should be straightforward after some processing of each entry. Writes for
replications (where the clientName is "") are logged as they have been; the
logging in PacketResponder has been reformatted to fit the preceding schema. A
few known issues:
* The logging assumes the IP address is sufficient to distinguish a source,
particularly for writes and in the shuffle
* This logs to the DataNode and ReduceTask appenders; these entries should be
directed elsewhere and disabled by default
* In testing this, some entries in the read exhibited a strange property: the
source and destination match, but neither matches the DataNode on which it is
logged. I'm clearly missing something.
I tried tracing a few blocks and map outputs through the logs and all made
sense. That said- as mentioned in the last bullet- not all of the entries made
sense.
> Need to capture the metrics for the network ios generate by dfs reads/writes
> and map/reduce shuffling and break them down by racks
> ------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HADOOP-3062
> URL: https://issues.apache.org/jira/browse/HADOOP-3062
> Project: Hadoop Core
> Issue Type: Improvement
> Components: metrics
> Reporter: Runping Qi
> Attachments: 3062-0.patch
>
>
> In order to better understand the relationship between hadoop performance and
> the network bandwidth, we need to know
> what the aggregated traffic data in a cluster and its breakdown by racks.
> With these data, we can determine whether the network
> bandwidth is the bottleneck when certain jobs are running on a cluster.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.