[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Chris Douglas (JIRA) Wed, 10 Sep 2008 11:50:41 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629908#action_12629908
 ]


chris.douglas edited comment on HADOOP-4143 at 9/10/08 11:48 AM:
-----------------------------------------------------------------

The performance reasons are pretty limited to "memcmp" types like Text and 
BytesWritable. Since the partitioner is called from collect when we still have 
the cooked records, the only motivation would be in support of partitioners 
like the one used in the terasort example. I talked offline with Owen about 
this, and he makes the case that a "MemComparable" interface to the 
aforementioned types would probably be more than sufficient for practical uses, 
more readable than the partitioner handling different/layered length encodings, 
and a more general abstraction than this is.

The only remaining reason would be the aforementioned space/time tradeoff, 
saving an int per record while adding a call to the partitioner for each 
compare in the sort. If this effected any improvement in running time, it would 
probably be noise at best and likely inferior to a better configuration.

I don't usually like "tagging" types, but the MemComparable interface will not 
only resolve any case this would, but could also help with RawComparator impl, 
table stores, etc. This was conceived as a way to avoid that, but it's clearly 
not an improvement on it and should probably be closed.

      was (Author: chris.douglas):
    The performance reasons are pretty limited to "memcmp" types like Text and 
BytesWritable. Since the partitioner is called from collect when we still have 
the cooked records, the only motivation would be in support of partitioners 
like the one used in the terasort example. I talked offline with Owen about 
this, and he makes the case that a "MemComparable" interface to the 
aforementioned types would probably be more than sufficient for practical uses, 
more readable than the partitioner handling different/layered length encodings, 
and a more general abstraction than this is.

The only remaining reason would be the aforementioned space/time tradeoff, 
saving an int per record while adding a call to the partitioner for each 
compare in the sort. If this effected any improvement in running time, it would 
probably be noise at best and likely inferior to a better configuration.

I don't usually like "tagging" types, but the MemComparable interface will not 
only resolve any case this would, but could also help with RawComparator impl, 
table stores, etc. This was conceived as a way to avoid that, but it's clearly 
not an improvement on it and should probably be closed as "Won't fix".
  
> Support for a "raw" Partitioner that partitions based on the serialized key 
> and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>         Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify 
> keys), it would be helpful if one could specify a "raw" partitioner that 
> would receive the serialized version of the key rather than the object 
> emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-4143) Support for a "raw" Partitioner that partitions based on the serialized key and not record objects

Reply via email to