[ https://issues.apache.org/jira/browse/HADOOP-5979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717622#action_12717622 ]
Klaas Bosteels commented on HADOOP-5979: ---------------------------------------- bq. But still, the command needs to have an idea of how many partitions there are, isn't it? Or maybe, you are saying that it's up to the command developer to assume a certain partition count and implement the command... I agree that it's simple but am not sure whether all use cases would be covered with this model.. Maybe it doesn't cover every possible use case, but it should cover the most common ones I think, and in case of streaming it might be more important to implement something that's very simple and easy to use instead of trying to make things as general as possible. Personally, I don't think I ever implemented a partitioner that couldn't be replaced by a command that outputs keys which then get hashed to determine the partition number. bq. What did you mean by "we wouldn't need any additional reading/writing logic" ? There is at least that much reading/writing as your code outlined, ist it? I meant that {{org.apache.hadoop.streaming.io.InputWriter}} and {{org.apache.hadoop.streaming.io.OutputReader}} wouldn't have to be extended in any way. Having said that, extending {{InputWriter}} and {{OutputReader}} is perfectly feasible, so if you think it's better to work with partition numbers directly we could also implement something like: {code} public int getPartition(K2 key, V2 value, int numPartitions) { if (!ignoreKey) { inWriter_.writeKey(key); } inWriter_.writeValue(value); inWriter_.writeNumber(numPartitions); return outReader_.readNumber(); } {code} This would definitely be more flexible and might also be more efficient in certain cases, so maybe it is indeed preferable. I guess that a partitioner command would also be a rather advanced feature anyway, so maybe it's fine to expect a bit more effort from the people who use it and let it determine the partition number directly. > Streaming partitioner should allow command, not just Java class > --------------------------------------------------------------- > > Key: HADOOP-5979 > URL: https://issues.apache.org/jira/browse/HADOOP-5979 > Project: Hadoop Core > Issue Type: Improvement > Components: contrib/streaming > Reporter: Klaas Bosteels > > Since HADOOP-4842 got committed, Streaming allows both commands and Java > classes to be specified as mapper, reducer, and combiner, but the > {{-partitioner}} option is still limited to Java classes only. Allowing > commands to be specified as partitioner as well would greatly improve the > flexibility of Streaming programs. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.