[
https://issues.apache.org/jira/browse/CRUNCH-192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13625844#comment-13625844
]
Micah Whitacre commented on CRUNCH-192:
---------------------------------------
If this is dumping onto the bug I can log another one, but can you also clarify
documentation for the appropriate use of values from inside the Iterable inside
of DoFn and MapFn?
As an example we've gotten bitten by the case where we were storing off the
individual items inside the Iterable to then do processing once we've read all
the values in.
{code}
@Override
public Foo map(final Pair<Bar, Iterable<Bat>> input) {
List<Bat> bats = ...;
for(Bat b: input){
bats.add(b);
}
return new Foo(bats);
}
{code}
When this gets ran during a reduce, the list bats will end up with a single
item instead of multiple items. For this to work properly we actually have to
make a copy of each item in the iterable. Making the javadoc more clearly
state this behavior would help consumers to write the MapFn/DoFn correctly the
first time.
> Document and enforce the semantics around reducer-based Iterables
> -----------------------------------------------------------------
>
> Key: CRUNCH-192
> URL: https://issues.apache.org/jira/browse/CRUNCH-192
> Project: Crunch
> Issue Type: Bug
> Reporter: Gabriel Reid
> Attachments: CRUNCH-192.patch
>
>
> As reported on [email protected] by Chad Urso McDaniel:
> BLUF: The Iterable parameter to CombineFn.process implies you can iterate
> multiple times when you cannot and this leads to surprising behavior.
> As many of you probably know, the signature of CombineFn.process is
> ---
> process(Pair<K, Iterable<V>> input, Emitter<Pair<K, V>> emitter)
> ---
> The corresponding Hadoop Reducer signature is
> ---
> reduce(K2 key, Iterator<V2> values, OutputCollector<K3,V3> output, Reporter
> reporter)
> ---
> I assume the Crunch use of Iterable is for convenient use in "for" loops.
> Unfortunately, the behavior of this Iterable seems to return the same
> Iterator object each time Iterable.iterator() is called.
> This makes sense to me based on the underlying hadoop mapreduce, but violates
> what I think most expect from the Iterable interface.
> I understand that it's too late to change the interface, but could we at
> least have an javadoc or an exception thrown if the Iterable is used more
> than once?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira