[GitHub] [spark] Yikun commented on a diff in pull request #36748: [SPARK-38961][DOCS][PYTHON][3.3] Enhance to automatically generate the pandas API support list

GitBox Wed, 01 Jun 2022 23:59:44 -0700


Yikun commented on code in PR #36748:
URL: https://github.com/apache/spark/pull/36748#discussion_r887620077



##########
python/pyspark/pandas/supported_api_gen.py:
##########
@@ -0,0 +1,375 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Generate 'Supported pandas APIs' documentation file
+"""
+import warnings
+from distutils.version import LooseVersion
+from enum import Enum, unique
+from inspect import getmembers, isclass, isfunction, signature
+from typing import Any, Callable, Dict, List, NamedTuple, Set, TextIO, Tuple
+
+import pyspark.pandas as ps
+import pyspark.pandas.groupby as psg
+import pyspark.pandas.window as psw
+
+import pandas as pd
+import pandas.core.groupby as pdg
+import pandas.core.window as pdw
+
+MAX_MISSING_PARAMS_SIZE = 5
+COMMON_PARAMETER_SET = {
+    "kwargs",
+    "args",
+    "cls",
+}  # These are not counted as missing parameters.
+MODULE_GROUP_MATCH = [(pd, ps), (pdw, psw), (pdg, psg)]
+
+RST_HEADER = """
+=====================
+Supported pandas API
+=====================
+
+.. currentmodule:: pyspark.pandas
+
+The following table shows the pandas APIs that implemented or non-implemented 
from pandas API on
+Spark. Some pandas API do not implement full parameters, so the third column 
shows missing
+parameters for each API.
+
+* 'Y' in the second column means it's implemented including its whole 
parameter.
+* 'N' means it's not implemented yet.
+* 'P' means it's partially implemented with the missing of some parameters.
+
+All API in the list below computes the data with distributed execution except 
the ones that require
+the local execution by design. For example, `DataFrame.to_numpy() 
<https://spark.apache.org/docs/
+latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_numpy.html>`__
+requires to collect the data to the driver side.
+
+If there is non-implemented pandas API or parameter you want, you can create 
an `Apache Spark
+JIRA <https://issues.apache.org/jira/projects/SPARK/summary>`__ to request or 
to contribute by
+your own.
+
+The API list is updated based on the `latest pandas official API reference
+<https://pandas.pydata.org/docs/reference/index.html#>`__.
+
+"""
+
+
+@unique
+class Implemented(Enum):
+    IMPLEMENTED = "Y"
+    NOT_IMPLEMENTED = "N"
+    PARTIALLY_IMPLEMENTED = "P"
+
+
+class SupportedStatus(NamedTuple):
+    """
+    Defines a supported status for specific pandas API
+    """
+
+    implemented: str
+    missing: str
+
+
+def generate_supported_api(output_rst_file_path: str) -> None:
+    """
+    Generate supported APIs status dictionary.
+
+    Parameters
+    ----------
+    output_rst_file_path : str
+        The path to the document file in RST format.
+
+    Write supported APIs documentation.
+    """
+    if LooseVersion(pd.__version__) < LooseVersion("1.4.0"):

Review Comment:
   If target version is Spark `3.3.0`, we might want to pin version to `1.3.5`.



##########
python/pyspark/pandas/supported_api_gen.py:
##########
@@ -0,0 +1,375 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#    http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+"""
+Generate 'Supported pandas APIs' documentation file
+"""
+import warnings
+from distutils.version import LooseVersion
+from enum import Enum, unique
+from inspect import getmembers, isclass, isfunction, signature
+from typing import Any, Callable, Dict, List, NamedTuple, Set, TextIO, Tuple
+
+import pyspark.pandas as ps
+import pyspark.pandas.groupby as psg
+import pyspark.pandas.window as psw
+
+import pandas as pd
+import pandas.core.groupby as pdg
+import pandas.core.window as pdw
+
+MAX_MISSING_PARAMS_SIZE = 5
+COMMON_PARAMETER_SET = {
+    "kwargs",
+    "args",
+    "cls",
+}  # These are not counted as missing parameters.
+MODULE_GROUP_MATCH = [(pd, ps), (pdw, psw), (pdg, psg)]
+
+RST_HEADER = """
+=====================
+Supported pandas API
+=====================
+
+.. currentmodule:: pyspark.pandas
+
+The following table shows the pandas APIs that implemented or non-implemented 
from pandas API on
+Spark. Some pandas API do not implement full parameters, so the third column 
shows missing
+parameters for each API.
+
+* 'Y' in the second column means it's implemented including its whole 
parameter.
+* 'N' means it's not implemented yet.
+* 'P' means it's partially implemented with the missing of some parameters.
+
+All API in the list below computes the data with distributed execution except 
the ones that require
+the local execution by design. For example, `DataFrame.to_numpy() 
<https://spark.apache.org/docs/
+latest/api/python/reference/pyspark.pandas/api/pyspark.pandas.DataFrame.to_numpy.html>`__
+requires to collect the data to the driver side.
+

Review Comment:
   
https://github.com/apache/spark/blame/branch-3.3/python/docs/source/user_guide/pandas_on_spark/supported_pandas_api.rst#L39
   
   FYI, some diff between 3.3 and original master.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Yikun commented on a diff in pull request #36748: [SPARK-38961][DOCS][PYTHON][3.3] Enhance to automatically generate the pandas API support list

Reply via email to