I have used a very similar script, I think there might be some extra steps that are needed before it could be as robust as toPandas. If you look at _to_corrected_pandas_type in the toPandas (https://github.com/apache/spark/blob/master/python/pyspark/sql/dataframe.py#L1869), this would have to be implemented for this too. I agree that serializing the data to a pandas dataframe or numpy array is faster and less memory intensive.
-- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org