Gabriel Reid created CRUNCH-192:
-----------------------------------

             Summary: Document and enforce the semantics around reducer-based 
Iterables
                 Key: CRUNCH-192
                 URL: https://issues.apache.org/jira/browse/CRUNCH-192
             Project: Crunch
          Issue Type: Bug
            Reporter: Gabriel Reid


As reported on [email protected] by Chad Urso McDaniel:

BLUF: The Iterable parameter to CombineFn.process implies you can iterate 
multiple times when you cannot and this leads to surprising behavior.

As many of you probably know, the signature of CombineFn.process is 
---
process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)
---

The corresponding Hadoop Reducer signature is
---
reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter 
reporter)
---

I assume the Crunch use of Iterable is for convenient use in "for" loops.

Unfortunately, the behavior of this Iterable seems to return the same Iterator 
object each time Iterable.iterator() is called. 

This makes sense to me based on the underlying hadoop mapreduce, but violates 
what I think most expect from the Iterable interface.

I understand that it's too late to change the interface, but could we at least 
have an javadoc or an exception thrown if the Iterable is used more than once?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to