A task may read from more than one block. For example, in line-oriented
input, lines frequently cross block boundaries. And a block may be read
from more than one host. For example, if a datanode dies midway through
providing a block, the client will switch to using a different datanode.
So the mapping is not simple. This information is also not, as you
inferred, available to applications. Why do you need this? Do you have
a compelling reason?
Doug
James Cipar wrote:
Is there any way to determine which replica of each chunk is read by a
map-reduce program? I've been looking through the hadoop code, and it
seems like it tries to hide those kinds of details from the higher level
API. Ideally, I'd like the host the task was running on, the file name
and chunk number, and the host the chunk was read from.