Github user kayousterhout commented on the pull request:
https://github.com/apache/spark/pull/143#issuecomment-37685303
I'm not sure this fixes the problem Reynold was referring to in his pull
request. If you look in DAGScheduler.scala, on line 773, it does essentially
the same thing you do here (serialize the closure to make sure it works); this
gets called as a result of dagScheduler.submitJob, which happens right after
the clean() function gets called on the RDD (so I think the functionality you
added already exists, it just gets invoked a bit later).
I think what @rxin was referring to is the fact that if you do a
transformation (e.g., call map on an RDD), it gets lazily evaluated (you can
see this if you look in RDD.scala:247, at the map() function -- it just creates
a new RDD object but doesn't evaluate the transformation). So the
serialization error won't occur until potentially much later, when the user
calls some other function that forces the transformation to be computed. My
understanding is that Reynold was suggesting adding the serialization in the
map() and other functions, as mentioned in the JIRA, so that the serialization
error gets triggered as soon as the user calls map().
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---