[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

GitBox Wed, 16 Jun 2021 00:04:42 -0700


HyukjinKwon commented on a change in pull request #32926:
URL: https://github.com/apache/spark/pull/32926#discussion_r652406807




##########
File path: python/docs/source/development/ps_design.rst
##########
@@ -1,85 +0,0 @@
-=================
-Design Principles
-=================
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic
------------
-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-------------------------------------------------------------------------
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-    
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
---------------------------------------------------------------------------------------------
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks
--------------------------------------------------------
-
-At the risk of overgeneralization, there are two API design approaches: the 
first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enable users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the later.
-
-One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
-
-Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
-
-Provide well documented APIs, with examples

Review comment:
       Removed as it's duplicate of https://spark.apache.org/contributing.html

##########
File path: python/docs/source/development/ps_design.rst
##########
@@ -1,85 +0,0 @@
-=================
-Design Principles
-=================
-
-.. currentmodule:: pyspark.pandas
-
-This section outlines design principles guiding the pandas API on Spark.
-
-Be Pythonic
------------
-
-Pandas API on Spark targets Python data scientists. We want to stick to the 
convention that users are already familiar with as much as possible. Here are 
some examples:
-
-- Function names and parameters use snake_case, rather than CamelCase. This is 
different from PySpark's design. For example, pandas API on Spark has 
`to_pandas()`, whereas PySpark has `toPandas()` for converting a DataFrame into 
a pandas DataFrame. In limited cases, to maintain compatibility with Spark, we 
also provide Spark's variant as an alias.
-
-- Pandas API on Spark respects to the largest extent the conventions of the 
Python numerical ecosystem, and allows the use of NumPy types, etc. that can be 
supported by Spark.
-
-- pandas-on-Spark docs' style and infrastructure simply follow rest of the 
PyData projects'.
-
-Unify small data (pandas) API and big data (Spark) API, but pandas first
-------------------------------------------------------------------------
-
-The pandas-on-Spark DataFrame is meant to provide the best of pandas and Spark 
under a single API, with easy and clear conversions between each API when 
necessary. When Spark and pandas have similar APIs with subtle differences, the 
principle is to honor the contract of the pandas API first.
-
-There are different classes of functions:
-
- 1. Functions that are found in both Spark and pandas under the same name 
(`count`, `dtypes`, `head`). The return value is the same as the return type in 
pandas (and not Spark's).
-    
- 2. Functions that are found in Spark but that have a clear equivalent in 
pandas, e.g. `alias` and `rename`. These functions will be implemented as the 
alias of the pandas function, but should be marked that they are aliases of the 
same functions. They are provided so that existing users of PySpark can get the 
benefits of pandas API on Spark without having to adapt their code.
- 
- 3. Functions that are only found in pandas. When these functions are 
appropriate for distributed datasets, they should become available in pandas 
API on Spark.
- 
- 4. Functions that are only found in Spark that are essential to controlling 
the distributed nature of the computations, e.g. `cache`. These functions 
should be available in pandas API on Spark.
-
-We are still debating whether data transformation functions only available in 
Spark should be added to pandas API on Spark, e.g. `select`. We would love to 
hear your feedback on that.
-
-Return pandas-on-Spark data structure for big data, and pandas data structure 
for small data
---------------------------------------------------------------------------------------------
-
-Often developers face the question whether a particular function should return 
a pandas-on-Spark DataFrame/Series, or a pandas DataFrame/Series. The principle 
is: if the returned object can be large, use a pandas-on-Spark 
DataFrame/Series. If the data is bound to be small, use a pandas 
DataFrame/Series. For example, `DataFrame.dtypes` return a pandas Series, 
because the number of columns in a DataFrame is bounded and small, whereas 
`DataFrame.head()` or `Series.unique()` returns a pandas-on-Spark 
DataFrame/Series, because the resulting object can be large.
-
-Provide discoverable APIs for common data science tasks
--------------------------------------------------------
-
-At the risk of overgeneralization, there are two API design approaches: the 
first focuses on providing APIs for common tasks; the second starts with 
abstractions, and enable users to accomplish their tasks by composing 
primitives. While the world is not black and white, pandas takes more of the 
former approach, while Spark has taken more of the later.
-
-One example is value count (count by some key column), one of the most common 
operations in data science. pandas `DataFrame.value_count` returns the result 
in sorted order, which in 90% of the cases is what users prefer when exploring 
data, whereas Spark's does not sort, which is more desirable when building data 
pipelines, as users can accomplish the pandas behavior by adding an explicit 
`orderBy`.
-
-Similar to pandas, pandas API on Spark should also lean more towards the 
former, providing discoverable APIs for common data science tasks. In most 
cases, this principle is well taken care of by simply implementing pandas' 
APIs. However, there will be circumstances in which pandas' APIs don't address 
a specific need, e.g. plotting for big data.
-
-Provide well documented APIs, with examples
--------------------------------------------
-
-All functions and parameters should be documented. Most functions should be 
documented with examples, because those are the easiest to understand than a 
blob of text explaining what the function does.
-
-A recommended way to add documentation is to start with the docstring of the 
corresponding function in PySpark or pandas, and adapt it for pandas API on 
Spark. If you are adding a new function, also add it to the API reference doc 
index page in `docs/source/reference` directory. The examples in docstring also 
improve our test coverage.
-
-Guardrails to prevent users from shooting themselves in the foot

Review comment:
       Moved and merged.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32926: [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section

Reply via email to