[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

JoshRosen Mon, 29 Dec 2014 11:47:39 -0800

Github user JoshRosen commented on the pull request:

    https://github.com/apache/spark/pull/3794#issuecomment-68295291
  
    To maybe summarize the motivation a bit more succinctly, it seems like the 
problem here is that the first call to `rdd.partitions` might be expensive and 
might occur inside the DAGScheduler event loop, blocking the entire scheduler.  
I guess this is an unfortunate side-effect of laziness: we might have expensive 
lazy initialization but it can be hard to reason about when/where it will 
occur, causing difficult-to-diagnose performance bottlenecks.
    
    It seems like the fix in this patch is to force `partitions` to be 
eagerly-computed in the driver thread that defines the RDD.  This seems like a 
good idea, but I have a few minor nits with the fix as it's currently 
implemented:
    
    - I understand that the motivation for this is HadoopRDD's expensive 
`getPartitions` method, but it seems like the problem is potentially more 
general.  Is there any way to handle this `RDD` instead?  I understand that we 
can't just make `partitions` into a `val`, but it looks like the `@transient 
partitions_` logic is already there in `RDD`, so maybe we could just toss a 
`self.partitions()` call into the `RDD` constructor to force eager evaluation 
on the driver?
    - If there's some reason that we can't implement my proposal in `RDD`, then 
I think we can just add a call to `self.partitions() `at the end of HadoopRDD; 
this would eliminate the need for a bunch of the confusing variable names added 
here.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4961] [CORE] Put HadoopRDD.getPartition...

Reply via email to