Github user HyukjinKwon commented on the issue: https://github.com/apache/spark/pull/21157 Woah. Okay. Let me add some guys interested in this again (@felixcheung looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen Additionally @rxin too. Here's my understanding: Reynold, here's what's going on: this is about the namedtuple hack removal we added a long long while ago. This hack isn't now super crucial since cloudpickle can handle this by its own without this hack. If we remove this, in case of normal RDD operations, that named tuple should be defined in local scope. If they are defined in global scope, it fails to pickle in the normal pickle (not cloudpickle which SQL code path uses). 1. So, real downside of removing this now is we disallow global scope namedtuple. 2. actual advantage of this is, that we can get rid of weird behaviours by this hack. For instance, see the PR description (both links https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html and https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html). @superbobry, wanna add some more words?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org