Hi,

A common pattern in my work is querying large tables in Spark DataFrames
and then needing to do more detailed analysis locally when the data can fit
into memory. However, i've hit a few blockers. In Scala no well developed
DataFrame library exists and in Python the `toPandas` function is very
slow. As Pandas is one of the best DataFrame libraries out there is may be
worth spending some time into making the `toPandas` method more efficient.

Having a quick look at the code it looks like a lot of iteration is
occurring on the Python side. Python is really slow at iterating over large
loop and this should be avoided. If iteration does have to occur its best
done in Cython. Has anyone looked at Cythonising the process? Or even
better serialising directly to Numpy arrays instead of the intermediate
lists of Rows.

Here are some links to the current code:

topandas:
https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L1342

collect:
https://github.com/apache/spark/blob/8e0b030606927741f91317660cd14a8a5ed6e5f9/python/pyspark/sql/dataframe.py#L233

_load_from_socket:
https://github.com/apache/spark/blob/a60f91284ceee64de13f04559ec19c13a820a133/python/pyspark/rdd.py#L123

Josh Levy-Kramer
Data Scientist @ Starcount

Reply via email to