Github user ericdf commented on the pull request:
https://github.com/apache/spark/pull/2463#issuecomment-56247070
Fundamentally the way union works is flawed because it forces a caller to
create a recursive structure.
In my case, I have
files = [] # some list
rdd = sc.createAnRDDInTheUsualWay(files[0])
for afile in files[1:]:
rdd = rdd.union(sc.createAnRDDInTheUsualWay(afile))
At each point in the loop, I'm creating a UnionRDD whose collection of RDDs
contains exactly one RDD (also a UnionRDD). You've coded for a tree, but
really have a linked list that will blow up the stack.
It should be possible for me to get a broad, flat structure instead,
ideally by doing something like this:
rddgen = (sc.createAnRddInTheUsualWay(x) for x in files)
rdd = sc.union(rddgen)
The proposed patch does not do that, but it should.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]