[
https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629677#action_12629677
]
chris.douglas edited comment on HADOOP-4143 at 9/9/08 6:58 PM:
---------------------------------------------------------------
There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define
ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or
emitted as bytes from the map can use a partitioner that understands the
underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following
proposal)
And, of course, disadvantages
# Different results if the key contains state determining its partition that is
not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to
resolve
Ideally, a partitioner could handle either raw or object data depending on its
configuration. Unfortunately, {{Partitioner}} is an interface, so adding a
{{boolean isRaw()}} method would break a *lot* of user code. That said, the
interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
public abstract getPartition(K key, V value, int numPartitions);
public boolean isRaw() { return false; }
public getPartition(byte[] keyBytes, int offset, int length, int
numPartitions) {
throw new UnsupportedOperationException("Not a raw Partitioner");
}
}
{code}
A wrapper would be trivial to write, could be a final class, etc. providing
backwards compatibility for a few releases. In support of records larger than
the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value.
For a sane set of partitioners, serializers, and value types, this should be
compatible with the existing semantics.
Modifications to {{MapTask}} would be minimal. A final bool can determine which
overloaded {{getPartition}} is called and there would be the
backwards-compatible wrapping of the deprecated Partitioner interface.
Thoughts? This doesn't solve a general problem, but it's useful for
partitioning relative to a set of keys and particularly helpful in keeping the
API to binary partitioners sane.
was (Author: chris.douglas):
There are several advantages to this.
# Partitioners (capable of) working with binary data wouldn't need to define
ways of extracting bytes from particular keytypes
# Where defined, RawComparators can be used without serializing the key
# Should be no performance hit for supporting both types of partitioner
# Keys read from a binary reader (e.g. {{SequenceFileAsBinaryInputFormat}}) or
emitted as bytes from the map can use a partitioner that understands the
underlying type
# Partitioner is already an abstract class in HADOOP-1230 (see following
proposal)
And, of course, disadvantages
# Different results if the key contains state determining its partition that is
not serialized
# Adds some complexity to {{MapTask}} with few use cases
# There's a (potential) naming issue with this proposal I'm not sure how to
resolve
Ideally, a partitioner could handle either raw or object data depending on its
configuration. Unfortunately, {{Partitioner}} is an interface, so adding a
{{boolean isRaw()}} method would break a *lot* of user code. That said, the
interface can be safely deprecated and replaced with an abstract class:
{code}
public abstract class Partitioner<K,V> {
public abstract getPartition(K key, V value, int numPartitions);
public boolean isRaw() { return false; }
public getPartition(byte[] keyBytes, int offset, int length, int
numPartitions) throws IOException {
throw new UnsupportedOperationException("Not a raw Partitioner");
}
}
{code}
A wrapper would be trivial to write, could be a final class, etc. providing
backwards compatibility for a few releases. In support of records larger than
the serialization buffer, there are at least two solutions:
# Add a deserialization pass in the raw {{getPartition}} (blech)
# Call getPartition after serializing the key and before serializing the value.
For a sane set of partitioners, serializers, and value types, this should be
compatible with the existing semantics.
Modifications to {{MapTask}} would be minimal. A final bool can determine which
overloaded {{getPartition}} is called and there would be the
backwards-compatible wrapping of the deprecated Partitioner interface.
Thoughts? This doesn't solve a general problem, but it's useful for
partitioning relative to a set of keys and particularly helpful in keeping the
API to binary partitioners sane.
> Support for a "raw" Partitioner that partitions based on the serialized key
> and not record objects
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-4143
> URL: https://issues.apache.org/jira/browse/HADOOP-4143
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Chris Douglas
>
> For some partitioners (particularly those using comparators to classify
> keys), it would be helpful if one could specify a "raw" partitioner that
> would receive the serialized version of the key rather than the object
> emitted from the map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.