[GitHub] spark pull request: SPARK-3604. Replace the map call in UnionRDD#g...

ericdf Fri, 19 Sep 2014 16:09:06 -0700

Github user ericdf commented on the pull request:

    https://github.com/apache/spark/pull/2463#issuecomment-56247070
  
    Fundamentally the way union works is flawed because it forces a caller to 
create a recursive structure.
    
    In my case, I have 
    
    files = [] # some list
    rdd = sc.createAnRDDInTheUsualWay(files[0])
    for afile in files[1:]:
      rdd = rdd.union(sc.createAnRDDInTheUsualWay(afile))
    
    At each point in the loop, I'm creating a UnionRDD whose collection of RDDs 
contains exactly one RDD (also a UnionRDD).  You've coded for a tree, but 
really have a linked list that will blow up the stack.
    
    It should be possible for me to get a broad, flat structure instead, 
ideally by doing something like this:
    
    rddgen = (sc.createAnRddInTheUsualWay(x) for x in files)
    rdd = sc.union(rddgen)
    
    The proposed patch does not do that, but it should.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: SPARK-3604. Replace the map call in UnionRDD#g...

Reply via email to