[GitHub] [spark] Yikun commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

GitBox Wed, 16 Jun 2021 00:01:36 -0700


Yikun commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652403201




##########
File path: python/docs/source/development/contributing.rst
##########
@@ -72,17 +72,86 @@ Preparing to Contribute Code Changes
 ------------------------------------
 
 Before starting to work on codes in PySpark, it is recommended to read `the 
general guidelines <https://spark.apache.org/contributing.html>`_.
-There are a couple of additional notes to keep in mind when contributing to 
codes in PySpark:
+Additionally, there are a couple of additional notes to keep in mind when 
contributing to codes in PySpark:
+
+* **Be Pythonic.**
+* **APIs are matched with Scala and Java sides in general.**
+* **PySpark specific APIs can still be considered as long as they are Pythonic 
and do not conflict with other existent APIs, for example, decorator usage of 
UDFs.**
+* **If you extend or modify public API, please adjust corresponding type 
hints. See `Contributing and Maintaining Type Hints`_ for details.**
+
+If you are fixing pandas API on Spark (``pyspark.pandas``) package, please 
consider the design principles below:
+
+* **Return pandas-on-Spark data structure for big data, and pandas data 
structure for small data**
+    Often developers face the question whether a particular function should 
return a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The 
principle is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
``DataFrame.head()`` or ``Series.unique()`` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
+
+* **Provide discoverable APIs for common data science tasks**
+    At the risk of overgeneralization, there are two API design approaches: 
the first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enables users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the latter.
+
+    One example is value count (count by some key column), one of the most 
common operations in data science. pandas ``DataFrame.value_count`` returns the 
result in sorted order, which in 90% of the cases is what users prefer when 
exploring data, whereas Spark's does not sort, which is more desirable when 
building data pipelines, as users can accomplish the pandas behavior by adding 
an explicit ``orderBy``.
+
+    Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
+
+* **Guardrails to prevent users from shooting themselves in the foot**
+    Certain operations in pandas are prohibitively expensive as data scales, 
and we don't want to give users the illusion that they can rely on such 
operations in pandas API on Spark. That is to say, methods implemented in 
pandas API on Spark should be safe to perform by default on large datasets. As 
a result, the following capabilities are not implemented in pandas API on Spark:
+
+    1. Capabilities that are fundamentally not parallelizable: e.g. 
imperatively looping over each element
+    2. Capabilities that require materializing the entire working set in a 
single node's memory. This is why we do not implement 
`pandas.DataFrame.to_xarray 
<https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_xarray.html>`_.
 Another example is the ``_repr_html_`` call caps the total number of records 
shown to a maximum of 1000, to prevent users from blowing up their driver node 
simply by typing the name of the DataFrame in a notebook.
+
+    A few exceptions, however, exist. One common pattern with "big data 
science" is that while the initial dataset is large, the working set becomes 
smaller as the analysis goes deeper. For example, data scientists often perform 
aggregation on datasets and want to then convert the aggregated dataset to some 
local data structure. To help data scientists, we offer the following:
+
+    * :func:`DataFrame.to_pandas` that returns a pandas DataFrame, koalas only

Review comment:
       koalas only  --> pandas API on Spark only?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

Reply via email to