Hi, I run two examples of a MR execution with the same input files and with 3 Reduce tasks defined. One example has the map-intermediate files compressed, and the other examples has uncompressed data. Below, I've put some debug lines that I put in the code.
1 - On the uncompressed data, the raw length is always smaller than the partition length, but on the compressed data, is not. Why in compressed data the raw length is bigger than the partition length? 2 - If we define the map-intermediate files as compressed, how the map-intermediate files are distributed to all reduces? Since we can split a compressed file, this means that each spill file is compressed? For example, Compressed(Spill idx 0) goes to Reduce 0, Compressed(Spill idx 1) goes to Reduce 1 and Compressed(Spill idx 2) goes to Reduce 2, Compressed data Spill idx 0 - SegmentStart: 0 Part length: 10560 Raw length: 27567 Spill idx 1 - SegmentStart: 10560 Part length: 10029 Raw length: 26003 Spill idx 2 - SegmentStart: 20589 Part length: 10142 Raw length: 26459 Spill idx 0 - SegmentStart: 0 Part length: 10202 Raw length: 26785 Spill idx 1 - SegmentStart: 10202 Part length: 9932 Raw length: 26100 Spill idx 2 - SegmentStart: 20134 Part length: 9926 Raw length: 25821 Spill idx 0 - SegmentStart: 0 Part length: 9410 Raw length: 24503 Spill idx 1 - SegmentStart: 9410 Part length: 9849 Raw length: 25564 Spill idx 2 - SegmentStart: 19259 Part length: 9489 Raw length: 24716 Spill idx 0 - SegmentStart: 0 Part length: 1661 Raw length: 3440 Spill idx 1 - SegmentStart: 1661 Part length: 1527 Raw length: 3160 Spill idx 2 - SegmentStart: 3188 Part length: 1737 Raw length: 3750 Non-compressed data Spill idx 0 - SegmentStart: 0 Part length: 27571 Raw length: 27567 Spill idx 1 - SegmentStart: 27571 Part length: 26007 Raw length: 26003 Spill idx 2 - SegmentStart: 53578 Part length: 26463 Raw length: 26459 Spill idx 0 - SegmentStart: 0 Part length: 26789 Raw length: 26785 Spill idx 1 - SegmentStart: 26789 Part length: 26104 Raw length: 26100 Spill idx 2 - SegmentStart: 52893 Part length: 25825 Raw length: 25821 Spill idx 0 - SegmentStart: 0 Part length: 24507 Raw length: 24503 Spill idx 1 - SegmentStart: 24507 Part length: 25568 Raw length: 25564 Spill idx 2 - SegmentStart: 50075 Part length: 24720 Raw length: 24716 Spill idx 0 - SegmentStart: 0 Part length: 3444 Raw length: 3440 Spill idx 1 - SegmentStart: 3444 Part length: 3164 Raw length: 3160 Spill idx 2 - SegmentStart: 6608 Part length: 3754 Raw length: 3750 Thanks, -- Pedro