[ 
https://issues.apache.org/jira/browse/CRUNCH-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Wills updated CRUNCH-494:
------------------------------
    Attachment: CRUNCH-494.patch

I'm guessing that what you're doing is iterating through some data set of 
unknown length, and you're continually unioning the newest PCollection you 
create with the previous one you just made, so something like:

PCollection unioned = ...;
while (someCondition) {
  unioned = unioned.union(newPCollection);
}

...and if you do that enough times, things just get really deep and hence the 
stack overflow. I'm not sure I can easily change the way the union chaining 
works w/o altering other behavior, but it's pretty easy to add Pipeline.union 
methods (one for PCollection, one for PTable) as I did in the attached patch 
which let you create a List<PCollection<S>> and pass it to Pipeline.union in 
order to get a single, unioned PCollection<S> that won't have the stack 
overflow problem.

> Unable to union large number of PCollections 
> ---------------------------------------------
>
>                 Key: CRUNCH-494
>                 URL: https://issues.apache.org/jira/browse/CRUNCH-494
>             Project: Crunch
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.8.3
>            Reporter: Surbhi Mungre
>            Assignee: Josh Wills
>            Priority: Minor
>         Attachments: CRUNCH-494.patch
>
>
> If you try to union large number of PCollections(~5K), then Crunch throws 
> StackOverflowError exception. 
> {noformat}
> java.lang.StackOverflowError
>       at 
> com.google.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:68)
>       at 
> com.google.common.collect.AbstractIndexedListIterator.<init>(AbstractIndexedListIterator.java:54)
>       at com.google.common.collect.Iterators$12.<init>(Iterators.java:1072)
>       at com.google.common.collect.Iterators.forArray(Iterators.java:1072)
>       at 
> com.google.common.collect.RegularImmutableList.iterator(RegularImmutableList.java:68)
>       at 
> com.google.common.collect.RegularImmutableList.iterator(RegularImmutableList.java:31)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:291)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
>       at 
> org.apache.crunch.impl.dist.collect.PCollectionImpl.getTargetDependencies(PCollectionImpl.java:292)
> {noformat}
> Here is a simple test which can reproduce the issue. 
> https://gist.github.com/anonymous/22f08511604341d0ffda



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to