Massive discrepancies in job's bytes written/read

Bryan Duxbury Tue, 17 Mar 2009 17:27:28 -0700

Hey all,

In looking at the stats for a number of our jobs, the amount of datathat the UI claims we've read from or written to HDFS is vastlylarger than the amount of data that should be involved in the job.For instance, we have a job that combines small files into big filesthat we're operating on around 2TB worth of data. The outputs in HDFS(via hadoop dfs -du) matches the expected size, but the jobtracker UIclaims that we've read and written around 22TB of data!

By all accounts, Hadoop is actually *doing* the right thing - we'renot observing excess data reading or writing anywhere. However, thismassive discrepancy makes the job stats essentially worthless forunderstanding IO in our jobs.

Does anyone know why there's such an enormous difference? Have othersexperienced this problem?


-Bryan

Massive discrepancies in job's bytes written/read

Reply via email to