Re: Spill and Map Output

Ravi Gummadi Wed, 22 Dec 2010 15:16:29 -0800

Are you asking about how reducer will know whether to uncompress or not ? May 
be it checks the config properties like mapreduce.map.output.compress, am not 
sure though.


-Ravi

On 12/23/10 4:30 AM, "Pedro Costa" <psdc1...@gmail.com> wrote:

I know that, but what's confusing me is that at
ReduceTask#MapOutputServlet, it has:

[code]
 //open the map-output file
        mapOutputIn = rfs.open(mapOutputFileName);

        //seek to the correct offset for the reduce
        mapOutputIn.seek(info.startOffset);
        long rem = info.partLength;
        int len =
          mapOutputIn.read(buffer, 0, (int)Math.min(rem, MAX_BYTES_TO_READ));
        while (rem > 0 && len >= 0) {
          rem -= len;
          try {
            shuffleMetrics.outputBytes(len);
            outStream.write(buffer, 0, len);
            outStream.flush();
          } catch (IOException ie) {
            isInputException = false;
            throw ie;
          }
          totalRead += len;
          len =
            mapOutputIn.read(buffer, 0, (int)Math.min(rem, MAX_BYTES_TO_READ));
        }
[/code]

And as you can see, this snippet is using  "long rem =
info.partLength;" instead of "long rem = info.rawLength;"

This line makes me wonder, in case of uncompressed data, how the
reduce knows that  it's being sent the raw data, instead of compressed
data, if the length between these 2 variables are different?




On Wed, Dec 22, 2010 at 10:49 PM, Ravi  Gummadi <gr...@yahoo-inc.com> wrote:
> PartLength is compressed length as the map output data could be compressed
> based on config setting.
> RawLength is uncompressed length.
>
> SortAndSpill() in MapTask.java has details of these as:
>
>            rec.startOffset = segmentStart;
>             rec.rawLength = writer.getRawLength();
>             rec.partLength = writer.getCompressedLength();
>
> -Ravi
>
> On 12/23/10 3:52 AM, "Pedro Costa" <psdc1...@gmail.com> wrote:
>
> A index record contains 3 variables:
> startOffset, rawLength and partLength.
>
> What's the difference between a raw length and a partition length?
>
>
> On Wed, Dec 22, 2010 at 10:05 PM, Ravi  Gummadi <gr...@yahoo-inc.com> wrote:
>> Each map task produces R partitions(as part of its output file) if the
>> number of reduce tasks for the job is R.
>> Reduce task asks the TaskTrackerWhereMapRan for its input. TaskTracker
>> gives
>> the corresponding partition in the map output file based on the reduce
>> task
>> id. For eg. TaskTracker gives the k th partition for reduce task
>> xxx_r_00000k.
>>
>> -Ravi
>>
>> On 12/23/10 3:24 AM, "Pedro Costa" <psdc1...@gmail.com> wrote:
>>
>> So, I conclude that a partition is defined by the offset.
>> But, for example, a Map Tasks produces 5 partitions. How the reduce
>> knows that it must fetch the 5 partitions? Where's this information?
>> This information is not only given by the offset.
>>
>> On Wed, Dec 22, 2010 at 9:07 PM, Ravi  Gummadi <gr...@yahoo-inc.com>
>> wrote:
>>> Each map task will generate a single intermediate file (i.e. Map output
>>> file). This is obtained by merging multiple spills, if spills needed to
>>> happen.
>>>
>>> Index file gives the details of the offset and length for each reducer.
>>> Offset is offset in the map output file where the input data for the
>>> particular reducer starts and length is the size of the data starting
>>> from
>>> the offset.
>>>
>>> -Ravi
>>>
>>>
>>> On 12/23/10 2:17 AM, "Pedro Costa" <psdc1...@gmail.com> wrote:
>>>
>>> Hi,
>>>
>>> 1 - I would like to understand how a partition works in the Map
>>> Reduce. I know that the Map Reduce contains the IndexRecord class that
>>> indicates the length of something. Is it the length of a partition or
>>> of a spill?
>>>
>>> 2 - In large map output, a partition can be a set of spills, or a
>>> spill is simple the same thing as a partition?
>>>
>>> Thanks,
>>> --
>>> Pedro
>>>
>>>
>>
>>
>>
>> --
>> Pedro
>>
>>
>
>
>
> --
> Pedro
>
>



--
Pedro

Re: Spill and Map Output

Reply via email to