Github user HyukjinKwon commented on the issue:

    https://github.com/apache/spark/pull/21157
  
    Woah. Okay. Let me add some guys interested in this again (@felixcheung 
looks already here) - @ueshin, @BryanCutler, @holdenk amd @JoshRosen 
    
    Additionally @rxin too. Here's my understanding:
    
    Reynold, here's what's going on: this is about the namedtuple hack removal 
we added a long long while ago. This hack isn't now super crucial since 
cloudpickle can handle this by its own without this hack. If we remove this, in 
case of normal RDD operations, that named tuple should be defined in local 
scope. If they are defined in global scope, it fails to pickle in the normal 
pickle (not cloudpickle which SQL code path uses).
    
    1. So, real downside of removing this now is we disallow global scope 
namedtuple. 
    
    2. actual advantage of this is, that we can get rid of weird behaviours by 
this hack. For instance, see the PR description (both links 
https://superbobry.github.io/pyspark-silently-breaks-your-namedtuples.html and 
https://superbobry.github.io/tensorflowonspark-or-the-namedtuple-patch-strikes-again.html).
    
    @superbobry, wanna add some more words?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to