[ 
https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677
 ] 

Chris Douglas commented on HADOOP-4143:
---------------------------------------

There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define 
ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or 
emitted as bytes from the map can use a partitioner that understands the 
underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following 
proposal)

And, of course, disadvantages
# Different results if the key contains state determining its partition that is 
not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to 
resolve

Ideally, a partitioner could handle either raw or object data depending on its 
configuration. Unfortunately, {{Partitioner}} is an interface, so adding a 
{{boolean isRaw()}} method would break a *lot* of user code. That said, the 
interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
  public abstract getPartition(K key, V value, int numPartitions);
  public boolean isRaw() { return false; }
  public getPartition(byte[] keyBytes, int offset, int length, int 
numPartitions) throws IOException {
    throw new UnsupportedOperationException("Not a raw Partitioner");
  }
}
{code}

A wrapper would be trivial to write, could be a final class, etc. providing 
backwards compatibility for a few releases. In support of records larger than 
the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value. 
For a sane set of partitioners, serializers, and value types, this should be 
compatible with the existing semantics.

Modifications to {{MapTask}} would be minimal. A final bool can determine which 
overloaded {{getPartition}} is called and there would be the 
backwards-compatible wrapping of the deprecated Partitioner interface.

Thoughts? This doesn't solve a general problem, but it's useful for 
partitioning relative to a set of keys and particularly helpful in keeping the 
API to binary partitioners sane.

> Support for a "raw" Partitioner that partitions based on the serialized key 
> and not record objects
> --------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-4143
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4143
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify 
> keys), it would be helpful if one could specify a "raw" partitioner that 
> would receive the serialized version of the key rather than the object 
> emitted from the map.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to