Github user icexelloss commented on the issue:
https://github.com/apache/spark/pull/18664
> The baseline should be (as said above): Internal optimisation should not
introduce any behaviour change, and we are discouraged to change the previous
behaviour unless it has bugs in general.
I am not sure if I totally agree with this.
Take the struct for instance, in the non-Arrow version, struct is turned
to `pyspark.sql.Row` object in `toPandas()`. I wouldn't call this bug because
it's design choice. However, this is maybe not the best design choice because
if the user pickle the `pandas.DataFrame` to a file and send it to someone, the
receiver won't be able to deserialize this `pandas.DataFrame` without having
the pyspark library dependency.
Now I am **not** trying to argue we should or should not turn struct
column to `pyspark.sql.Row`, my point is that there might be some design
choices in the non-Arrow versions that are not ideal and maybe that's not a
hard requirement to make behavior 100% the same between non-Arrow and Arrow
version.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]