[ 
https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717622#action_12717622
 ] 

Klaas Bosteels commented on HADOOP-5979:
----------------------------------------

bq. But still, the command needs to have an idea of how many partitions there 
are, isn't it? Or maybe, you are saying that it's up to the command developer 
to assume a certain partition count and implement the command... I agree that 
it's simple but am not sure whether all use cases would be covered with this 
model..

Maybe it doesn't cover every possible use case, but it should cover the most 
common ones I think, and in case of streaming it might be more important to 
implement something that's very simple and easy to use instead of trying to 
make things as general as possible. Personally, I don't think I ever 
implemented a partitioner that couldn't be replaced by a command that outputs 
keys which then get hashed to determine the partition number. 

bq. What did you mean by "we wouldn't need any additional reading/writing 
logic" ? There is at least that much reading/writing as your code outlined, ist 
it?

I meant that {{org.apache.hadoop.streaming.io.InputWriter}} and 
{{org.apache.hadoop.streaming.io.OutputReader}} wouldn't have to be extended in 
any way.

Having said that, extending {{InputWriter}} and {{OutputReader}} is perfectly 
feasible, so if you think it's better to work with partition numbers directly 
we could also implement something like:
{code}
public int getPartition(K2 key, V2 value, int numPartitions) {
  if (!ignoreKey) {
    inWriter_.writeKey(key);
  }
  inWriter_.writeValue(value);
  inWriter_.writeNumber(numPartitions);
  return outReader_.readNumber();
}
{code}
This would definitely be more flexible and might also be more efficient in 
certain cases, so maybe it is indeed preferable. I guess that a partitioner 
command would also be a rather advanced feature anyway, so maybe it's fine to 
expect a bit more effort from the people who use it and let it determine the 
partition number directly.

> Streaming partitioner should allow command, not just Java class
> ---------------------------------------------------------------
>
>                 Key: HADOOP-5979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5979
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: contrib/streaming
>            Reporter: Klaas Bosteels
>
> Since HADOOP-4842 got committed, Streaming allows both commands and Java 
> classes to be specified as mapper, reducer, and combiner, but the 
> {{-partitioner}} option is still limited to Java classes only. Allowing 
> commands to be specified as partitioner as well would greatly improve the 
> flexibility of Streaming programs.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to