The number of bytes read can exceed the block size somewhat because each block 
rarely starts/ends on a record (e.g. line) boundary. So usually it reads to 
read a bit before and/or after the actual block boundary in to correctly read 
in all of the records it is supposed to. If you look, it's not having to read 
all that much extra data.

--Aaron

-----------------------------------------------------------------
From: Virajith Jalaparti [mailto:virajit...@gmail.com]
Sent: Tuesday, July 12, 2011 3:21 PM
To: mapreduce-user@hadoop.apache.org
Subject: Re: Lack of data locality in Hadoop-0.20.2

Is the non-data local nature of the maps possible due to the amount of HDFS 
data read by each map being greater than the HDFS block size? In the job I was 
running, the HDFS block size dfs.block.size was 134217728 and the 
HDFS_BYTES_READ by the maps was 134678218 and FILE_BYTES_READ was 134698338.
So, HDFS_BYTES_READ  is greater than dfs.block.size. Does this imply that most 
of the map tasks will be non-local? Further would Hadoop ensure that the map 
task is scheduled on the node which has the larger chunk of the data that is to 
be read by the task?

Thanks,
Virajith

On Tue, Jul 12, 2011 at 7:20 PM, Allen Wittenauer <a...@apache.org> wrote:

On Jul 12, 2011, at 10:27 AM, Virajith Jalaparti wrote:

> I agree that the scheduler has lesser leeway when the replication factor is
> 1. However, I would still expect the number of data-local tasks to be more
> than 10% even when the replication factor is 1.
       How did you load your data?

       Did you load it from outside the grid or from one of the datanodes?  If 
you loaded from one of the datanodes, you'll basically have no real locality, 
especially with a rep factor of 1.


Reply via email to