[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

icexelloss Fri, 06 Oct 2017 10:53:34 -0700

Github user icexelloss commented on the issue:

    https://github.com/apache/spark/pull/18664
  
    > The baseline should be (as said above): Internal optimisation should not 
introduce any behaviour change, and we are discouraged to change the previous 
behaviour unless it has bugs in general.
    
    I am not sure if I totally agree with this. 
    
    Take the struct  for instance, in the non-Arrow version, struct is turned 
to `pyspark.sql.Row` object in `toPandas()`. I wouldn't call this bug because 
it's design choice. However, this is maybe not the best design choice because 
if the user pickle the `pandas.DataFrame` to a file and send it to someone, the 
receiver won't be able to deserialize this `pandas.DataFrame` without having 
the pyspark library dependency.
    
     Now I am **not** trying to argue we should or should not turn struct 
column to `pyspark.sql.Row`, my point is that there might be some design 
choices in the non-Arrow versions that are not ideal and maybe that's not a 
hard requirement to make behavior 100% the same between non-Arrow and Arrow 
version.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #18664: [SPARK-21375][PYSPARK][SQL][WIP] Add Date and Timestamp ...

Reply via email to