[
https://issues.apache.org/jira/browse/HADOOP-4143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12629908#action_12629908
]
chris.douglas edited comment on HADOOP-4143 at 9/10/08 11:48 AM:
-----------------------------------------------------------------
The performance reasons are pretty limited to "memcmp" types like Text and
BytesWritable. Since the partitioner is called from collect when we still have
the cooked records, the only motivation would be in support of partitioners
like the one used in the terasort example. I talked offline with Owen about
this, and he makes the case that a "MemComparable" interface to the
aforementioned types would probably be more than sufficient for practical uses,
more readable than the partitioner handling different/layered length encodings,
and a more general abstraction than this is.
The only remaining reason would be the aforementioned space/time tradeoff,
saving an int per record while adding a call to the partitioner for each
compare in the sort. If this effected any improvement in running time, it would
probably be noise at best and likely inferior to a better configuration.
I don't usually like "tagging" types, but the MemComparable interface will not
only resolve any case this would, but could also help with RawComparator impl,
table stores, etc. This was conceived as a way to avoid that, but it's clearly
not an improvement on it and should probably be closed.
was (Author: chris.douglas):
The performance reasons are pretty limited to "memcmp" types like Text and
BytesWritable. Since the partitioner is called from collect when we still have
the cooked records, the only motivation would be in support of partitioners
like the one used in the terasort example. I talked offline with Owen about
this, and he makes the case that a "MemComparable" interface to the
aforementioned types would probably be more than sufficient for practical uses,
more readable than the partitioner handling different/layered length encodings,
and a more general abstraction than this is.
The only remaining reason would be the aforementioned space/time tradeoff,
saving an int per record while adding a call to the partitioner for each
compare in the sort. If this effected any improvement in running time, it would
probably be noise at best and likely inferior to a better configuration.
I don't usually like "tagging" types, but the MemComparable interface will not
only resolve any case this would, but could also help with RawComparator impl,
table stores, etc. This was conceived as a way to avoid that, but it's clearly
not an improvement on it and should probably be closed as "Won't fix".
> Support for a "raw" Partitioner that partitions based on the serialized key
> and not record objects
> --------------------------------------------------------------------------------------------------
>
> Key: HADOOP-4143
> URL: https://issues.apache.org/jira/browse/HADOOP-4143
> Project: Hadoop Core
> Issue Type: Improvement
> Components: mapred
> Reporter: Chris Douglas
> Attachments: 4143-0.patch
>
>
> For some partitioners (particularly those using comparators to classify
> keys), it would be helpful if one could specify a "raw" partitioner that
> would receive the serialized version of the key rather than the object
> emitted from the map.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.