Github user JoshRosen commented on the pull request:
https://github.com/apache/spark/pull/3794#issuecomment-68295291
To maybe summarize the motivation a bit more succinctly, it seems like the
problem here is that the first call to `rdd.partitions` might be expensive and
might occur inside the DAGScheduler event loop, blocking the entire scheduler.
I guess this is an unfortunate side-effect of laziness: we might have expensive
lazy initialization but it can be hard to reason about when/where it will
occur, causing difficult-to-diagnose performance bottlenecks.
It seems like the fix in this patch is to force `partitions` to be
eagerly-computed in the driver thread that defines the RDD. This seems like a
good idea, but I have a few minor nits with the fix as it's currently
implemented:
- I understand that the motivation for this is HadoopRDD's expensive
`getPartitions` method, but it seems like the problem is potentially more
general. Is there any way to handle this `RDD` instead? I understand that we
can't just make `partitions` into a `val`, but it looks like the `@transient
partitions_` logic is already there in `RDD`, so maybe we could just toss a
`self.partitions()` call into the `RDD` constructor to force eager evaluation
on the driver?
- If there's some reason that we can't implement my proposal in `RDD`, then
I think we can just add a call to `self.partitions() `at the end of HadoopRDD;
this would eliminate the need for a bunch of the confusing variable names added
here.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]