Re: Massive discrepancies in job's bytes written/read

Bryan Duxbury Tue, 17 Mar 2009 19:44:37 -0700

There is no compression in the mix for us, so that's not the culprit.

I'd be sort of willing to believe that spilling and sorting play arole in this, but, wow, over 10x read and write? That seems like abig problem.


-Bryan

On Mar 17, 2009, at 6:46 PM, Stefan Will wrote:

Some of the discrepancy could be due to compression of map input/outputformat. E.g. The mapper output bytes will show the compressed size,whilethe reduce input will show the uncompressed size. Or somethingalong thoselines. But I'm also questioning the accuracy of the reporting andsuspect

that some if it is due to all the disk activity that happens while
processing spills in the mapper and the copy/shuffle/sort phase in the

reducer. It would certainly be nice if all the byte counts werereported in

a way that they're comparable.

-- Stefan

From: Bryan Duxbury <br...@rapleaf.com>
Reply-To: <core-user@hadoop.apache.org>
Date: Tue, 17 Mar 2009 17:26:52 -0700
To: <core-user@hadoop.apache.org>
Subject: Massive discrepancies in job's bytes written/read

Hey all,

In looking at the stats for a number of our jobs, the amount of data
that the UI claims we've read from or written to HDFS is vastly
larger than the amount of data that should be involved in the job.
For instance, we have a job that combines small files into big files
that we're operating on around 2TB worth of data. The outputs in HDFS
(via hadoop dfs -du) matches the expected size, but the jobtracker UI
claims that we've read and written around 22TB of data!

By all accounts, Hadoop is actually *doing* the right thing - we're
not observing excess data reading or writing anywhere. However, this
massive discrepancy makes the job stats essentially worthless for
understanding IO in our jobs.

Does anyone know why there's such an enormous difference? Have others
experienced this problem?

-Bryan

Re: Massive discrepancies in job's bytes written/read

Reply via email to