Are you asking about how reducer will know whether to uncompress or not ? May be it checks the config properties like mapreduce.map.output.compress, am not sure though.
-Ravi On 12/23/10 4:30 AM, "Pedro Costa" <psdc1...@gmail.com> wrote: I know that, but what's confusing me is that at ReduceTask#MapOutputServlet, it has: [code] //open the map-output file mapOutputIn = rfs.open(mapOutputFileName); //seek to the correct offset for the reduce mapOutputIn.seek(info.startOffset); long rem = info.partLength; int len = mapOutputIn.read(buffer, 0, (int)Math.min(rem, MAX_BYTES_TO_READ)); while (rem > 0 && len >= 0) { rem -= len; try { shuffleMetrics.outputBytes(len); outStream.write(buffer, 0, len); outStream.flush(); } catch (IOException ie) { isInputException = false; throw ie; } totalRead += len; len = mapOutputIn.read(buffer, 0, (int)Math.min(rem, MAX_BYTES_TO_READ)); } [/code] And as you can see, this snippet is using "long rem = info.partLength;" instead of "long rem = info.rawLength;" This line makes me wonder, in case of uncompressed data, how the reduce knows that it's being sent the raw data, instead of compressed data, if the length between these 2 variables are different? On Wed, Dec 22, 2010 at 10:49 PM, Ravi Gummadi <gr...@yahoo-inc.com> wrote: > PartLength is compressed length as the map output data could be compressed > based on config setting. > RawLength is uncompressed length. > > SortAndSpill() in MapTask.java has details of these as: > > rec.startOffset = segmentStart; > rec.rawLength = writer.getRawLength(); > rec.partLength = writer.getCompressedLength(); > > -Ravi > > On 12/23/10 3:52 AM, "Pedro Costa" <psdc1...@gmail.com> wrote: > > A index record contains 3 variables: > startOffset, rawLength and partLength. > > What's the difference between a raw length and a partition length? > > > On Wed, Dec 22, 2010 at 10:05 PM, Ravi Gummadi <gr...@yahoo-inc.com> wrote: >> Each map task produces R partitions(as part of its output file) if the >> number of reduce tasks for the job is R. >> Reduce task asks the TaskTrackerWhereMapRan for its input. TaskTracker >> gives >> the corresponding partition in the map output file based on the reduce >> task >> id. For eg. TaskTracker gives the k th partition for reduce task >> xxx_r_00000k. >> >> -Ravi >> >> On 12/23/10 3:24 AM, "Pedro Costa" <psdc1...@gmail.com> wrote: >> >> So, I conclude that a partition is defined by the offset. >> But, for example, a Map Tasks produces 5 partitions. How the reduce >> knows that it must fetch the 5 partitions? Where's this information? >> This information is not only given by the offset. >> >> On Wed, Dec 22, 2010 at 9:07 PM, Ravi Gummadi <gr...@yahoo-inc.com> >> wrote: >>> Each map task will generate a single intermediate file (i.e. Map output >>> file). This is obtained by merging multiple spills, if spills needed to >>> happen. >>> >>> Index file gives the details of the offset and length for each reducer. >>> Offset is offset in the map output file where the input data for the >>> particular reducer starts and length is the size of the data starting >>> from >>> the offset. >>> >>> -Ravi >>> >>> >>> On 12/23/10 2:17 AM, "Pedro Costa" <psdc1...@gmail.com> wrote: >>> >>> Hi, >>> >>> 1 - I would like to understand how a partition works in the Map >>> Reduce. I know that the Map Reduce contains the IndexRecord class that >>> indicates the length of something. Is it the length of a partition or >>> of a spill? >>> >>> 2 - In large map output, a partition can be a set of spills, or a >>> spill is simple the same thing as a partition? >>> >>> Thanks, >>> -- >>> Pedro >>> >>> >> >> >> >> -- >> Pedro >> >> > > > > -- > Pedro > > -- Pedro