[jira] [Commented] (MAPREDUCE-2538) InputSampler.writePartitionFile() may write duplicate keys

Devaraj K (JIRA) Fri, 08 Jul 2011 06:28:44 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13061953#comment-13061953
 ]


Devaraj K commented on MAPREDUCE-2538:
--------------------------------------

bq.not sure about this patch – it seems it now can result in a partition filer 
that's "valid", but has fewer partitions than the user has specified, right?
If continuous two or more partitions contain same elements, then it will write 
fewer elements than the user specified.
    If the two continuous partitions have all elements are same, then it will 
take one element from first partition other one will skip because it cannot 
write the duplicate element.

Here taking the last before element in the partition which is not equal to the 
previously written element. Because if we take the last element in the 
partition, there may be chance of having all the elements in the next partition 
equal to that element and we may not get any element in that partition.

Example 1:
If sample output is ‘1, 2, 3, 4’,  ‘5, 6, 7, 7’,   ‘8, 9, 10, 11’,  ‘11, 11, 
11, 11’,   ‘12, 13, 14, 15’ and no. of partitions is 5, then it gives the 
result as ‘3, 6, 10, 11, 14’ after applying the patch.



Example 2:
If sample output is ‘1, 2, 3, 4’,  ‘5, 6, 7, 7’,   ‘8, 9, 10, 11’, ‘ 11, 11, 
11, 11’,  ‘ 11, 11, 11, 11’,  ‘12, 13, 14, 15’ and no. of partitions is 6 then
It gives the result as ‘3, 6, 10, 11, 14’ after applying the patch. No of 
elements are less than no. of partitions but we cannot take the duplicate 
element from the 5th partition.


> InputSampler.writePartitionFile() may write duplicate keys
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-2538
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2538
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2
>         Environment: EMR.
>            Reporter: Michael White
>            Assignee: Devaraj K
>            Priority: Minor
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-2538.patch
>
>
> InputSampler.writePartitionFile() outputs the same key multiple times if the 
> input samples have enough of a given key to span multiple partitions.  There 
> is logic in the code that appears to try to avoid this, but seems incorrect:
> for(int i = 1; i < numPartitions; ++i) {
>   int k = Math.round(stepSize * i);
>   while (last >= k && comparator.compare(samples[last], samples[k]) == 0) {
>     ++k;
>   }
>   writer.append(samples[k], nullValue);
>   last = k;
> }
> The while loop condition "last >= k" is always false.  The sample comparison 
> after the && never occurs.
> It's not entirely clear what the correct fix is.  The current behavior is 
> arguably correct mathematically, though the while loop could be elided for 
> clarity.  If bug MAPREDUCE-1987 were fixed, it would be less of a problem 
> (for me at least), since that is where the non-uniqueness causes me problems.
> Alternatively, changing the while to:
> "if( last >= 0) {
>    while (comparator.compare(samples[last], samples[k]) >= 0)) {"
> or, optimized for skipping over many duplicates (but arguably less clear):
> "if (last >= 0) {
>    while (last >= k || comparator.compare(samples[last], samples[k]) >= 0)) {"
> would probably achieve what the original author intended.
> Perhaps the behavior could be selected by a parameter, e.g. "boolean unique".

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (MAPREDUCE-2538) InputSampler.writePartitionFile() may write duplicate keys

Reply via email to