srowen commented on a change in pull request #24234:
[WIP][SPARK_26022][PYTHON][DOCS] PySpark Comparison with Pandas
URL: https://github.com/apache/spark/pull/24234#discussion_r270019798
##########
File path: docs/sql-pyspark-comparison-with-pandas.md
##########
@@ -0,0 +1,401 @@
+---
+layout: global
+title: PySpark Comparison with Pandas
+displayTitle: PySpark Comparison with Pandas
+---
+
+Both PySpark and Pandas cover important use cases and provide a rich set of
features to interact
+with various structural and semistructral data in Python world. Often, PySpark
users are used to
+Pandas. Therefore, this document targets to document the comparison.
+
+* Overview
+* DataFrame APIs
+ * Quick References
+ * Create DataFrame
+ * Load DataFrame
+ * Save DataFrame
+ * Inspect DataFrame
+ * Interaction between PySpark and Pandas
+* Notable Differences
+ * Lazy and Eager Evaluation
+ * Direct assignment
+ * NULL, None, NaN and NaT
+ * Type inference, coercion and cast
+
+
+## Overview
+
+PySpark and Pandas support common functionality to load, save, create,
transform and describe
+DataFrame. PySpark provides conversion from/to Pandas DataFrame, and PySpark
introduced Pandas
+UDFs which allow to use Pandas APIs as are for interoperability between them.
+
+Nevertheless, there are fundamental differences between them to note in
general.
+
+1. PySpark DataFrame is a distributed dataset across multiple nodes whereas
Pandas DataFrame is a
+ local dataset within single node.
+
+ It brings a practical point. If you handle larget dataset, arguably
PySpark brings arguably a
+ better performance in general. If the dataset to process does not fix into
the memory in a
+ single node, using PySpark is probably the way. In case of small dataset,
Pandas might be
+ faster in general since there would not be overhead, for instance, network.
+
+2. PySpark DataFrame is lazy evaluation whereas Pandas DataFrame is eager
evaluation.
+
+ PySpark DataFrame executes lazily whereas Pandas DataFrame executes each
operation
+ immediately against the data set.
+
+3. PySpark DataFrame is immutable in nature whereas Pandas DataFrame is
mutable.
+
+ In PySpark, it creates DataFrame once which cannot be changed. Instead, it
should transform
+ it to another DataFrame whereas Pandas DataFrame is mutable which directly
updates the state
+ of it. Typical example is `String` vs `StringBuilder` in Java.
+
+4. PySpark operations on DataFrame tend to comply SQL.
+
+ It causes some subtleties comparing to Pandas, for instance, about `NaN`,
`None` and `NULL`.
+
+There are similarities and differences between them which might bring
confusion. In this document
+these are described and illuminated by several examples.
+
+
+
+## DataFrame APIs
+
+This chapter describes DataFrame APIs in both PySpark and Pandas.
+
+
+### Quick References
+
+| PySpark | Pandas
|
+| ------------------------------------------------------------------ |
---------------------------------------- |
+| `df.limit(3)` |
`df.head(3)` |
+| `df.filter("a == 1 AND b == 2")` |
`df.filter("(df.a == 1) & (df.b == 2)")` |
+| `df.filter((df.a == 1) & (df.b == 2))` |
`df[(df.a == 1) & (df.b == 2)]` |
+| `df.select("a", "b")` |
`df[["a", "b"]]` |
+| `df.drop_duplicates()` |
`df.drop_duplicates()` |
+| `df.sample(fraction=0.01)` |
`df.sample(frac=0.01)` |
+| `df.groupby("a").count()` |
`df.groupby("a").size()` |
+| `df.groupby("a").agg({"b": "sum"})` |
`df.groupby("a").agg({"b": np.sum})` |
+| `df1.join(df2, on="a")` |
`pandas.merge(df1, df2, on="a")` |
+| `df1.union(df2)` |
`pandas.concat(df1, df2)` |
+| `df = df.select(when(df["a"] < 5, df["a"] * 2).otherwise(df["a"]))`|
`df.loc[pdf['a'] < 5, 'a'] *= 2` |
+
+
+### Create DataFrame
+
+In order to create DataFrame in PySpark and Pandas, you can run the codes
below:
+
+```python
+# PySpark
+data = zip(['Chicago', 'San Francisco', 'New York City'], range(1, 4))
+spark.createDataFrame(list(data), ["city", "rank"])
+```
+
+```python
+# Pandas
+data = {'city': ['Chicago', 'San Francisco', 'New York City'], 'rank':
range(1, 4)}
+pandas.DataFrame(data)
+```
+
+One notable difference when creating DataFrame is that Pandas accepts the data
as below:
+
+```
+data = {
+ 'city': ['Chicago', 'San Francisco', 'New York City'],
+ 'rank': range(1, 4)
+}
+```
+
+and it interprets as:
+
+```
+ city rank
+0 Chicago 1
+1 San Francisco 2
+2 New York City 3
+```
+
+So, a dictionary that contains key and multiple values becomes DataFrame but
PySpark does
Review comment:
key and ... -> multiple values for a key
becomes DataFrame -> becomes a DataFrame,
Pyspark does -> Pyspark does not?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]