srowen commented on a change in pull request #24234: [WIP][SPARK_26022][PYTHON][DOCS] PySpark Comparison with Pandas URL: https://github.com/apache/spark/pull/24234#discussion_r270020781
########## File path: docs/sql-pyspark-comparison-with-pandas.md ########## @@ -0,0 +1,401 @@ +--- +layout: global +title: PySpark Comparison with Pandas +displayTitle: PySpark Comparison with Pandas +--- + +Both PySpark and Pandas cover important use cases and provide a rich set of features to interact +with various structural and semistructral data in Python world. Often, PySpark users are used to +Pandas. Therefore, this document targets to document the comparison. + +* Overview +* DataFrame APIs + * Quick References + * Create DataFrame + * Load DataFrame + * Save DataFrame + * Inspect DataFrame + * Interaction between PySpark and Pandas +* Notable Differences + * Lazy and Eager Evaluation + * Direct assignment + * NULL, None, NaN and NaT + * Type inference, coercion and cast + + +## Overview + +PySpark and Pandas support common functionality to load, save, create, transform and describe +DataFrame. PySpark provides conversion from/to Pandas DataFrame, and PySpark introduced Pandas +UDFs which allow to use Pandas APIs as are for interoperability between them. + +Nevertheless, there are fundamental differences between them to note in general. + +1. PySpark DataFrame is a distributed dataset across multiple nodes whereas Pandas DataFrame is a + local dataset within single node. + + It brings a practical point. If you handle larget dataset, arguably PySpark brings arguably a + better performance in general. If the dataset to process does not fix into the memory in a Review comment: fix -> fit ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
