Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3794#issuecomment-68382391
  
    @markhamstra 
    
    > How would this interact with the idea of @erikerlandson to defer 
partition computation?
    #3079
    
    Maybe I'm overlooking something, but #3079 seems kind of orthogonal.  It 
seems like that issue is concerned with making the `sortByKey` transformation 
lazy so that it does not eagerly trigger a Spark job to compute the range 
partition boundaries, whereas this pull request is related to eager vs. lazy 
evaluation of what's effectively a Hadoop filesystem metadata call.
    
    Maybe eager vs. lazy is the wrong way to think about this PR's issue, 
though, since I guess we're more concerned with _where_ the call is performed 
(blocking DAGScheduler's event loop vs. a driver user-code thread) than when 
it's performed.  I suppose that maybe you could contrive an example where this 
patch changes the behavior of a user job, since maybe someone defines some 
transformations up-front, runs jobs to generate output, then reads it back in 
another RDD, in which case the data to be read might not exist at the time that 
the RDD is defined but will exist when the first action on it is invoked.  So, 
maybe we should consider moving the first `partitions` call closer to the 
DAGScheduler's job submission methods, but not inside of the actor (e.g. don't 
change any code in `RDD`, but just add a call that traverses the lineage chain 
and calls `partitions` on each RDD, making sure that this call occurs before 
the job submitter sends a message into the DAGScheduler actor).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to