[GitHub] [spark] itholic opened a new pull request #32036: [SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark

GitBox Thu, 01 Apr 2021 23:29:33 -0700


itholic opened a new pull request #32036:
URL: https://github.com/apache/spark/pull/32036



   ### What changes were proposed in this pull request?
   
   As a first step of 
[SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR 
proposes porting the Koalas main code into PySpark.
   
   This PR contains minimal changes to the existing Koalas code as follows:
   1. `databricks.koalas` -> `pyspark.pandas`
   2. `from databricks import koalas as ks` -> `from pyspark import pandas as 
pp`
   3. `ks.xxx -> pp.xxx`
   
   When this PR is merged, all the features that were previously used in 
[Koalas](https://github.com/databricks/koalas) will be available in PySpark as 
well.
   
   Users can access to the pandas API in PySpark as below:
   
   ```python
   >>> from pyspark import pandas as pp
   >>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]})
   >>> ppdf
      A   B
   0  1  15
   1  2  20
   2  3  25
   ```
   
   The existing "options and settings" in Koalas are also available in the same 
way:
   
   ```python
   >>> from pyspark.pandas.config import set_option, reset_option, get_option
   >>> ppser1 = pp.Series([1, 2, 3])
   >>> ppser2 = pp.Series([3, 4, 5])
   >>> ppser1 + ppser2
   Traceback (most recent call last):
   ...
   ValueError: Cannot combine the series or dataframe because it comes from a 
different dataframe. In order to allow this operation, enable 
'compute.ops_on_diff_frames' option.
   
   >>> set_option("compute.ops_on_diff_frames", True)
   >>> ppser1 + ppser2
   0    4
   1    6
   2    8
   dtype: int64
   ```
   
   Please also refer to the [API 
Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and 
[Options and 
Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for 
more detail.
   
   **NOTE** that this PR intentionally ports the main codes of Koalas first 
almost as are with minimal changes because:
   - Koalas project is fairly large. Making some changes together for PySpark 
will make it difficult to review the individual change.
       Koalas dev includes multiple Spark committers who will review. By doing 
this, the committers will be able to more easily and effectively review and 
drive the development.
   - Koalas tests and documentation require major changes to make it look great 
together with PySpark whereas main codes do not require.
   - We currently froze the Koalas codebase, and plan to work together on the 
initial porting. By porting the main codes first as is, it unblocks the Koalas 
dev to work on other items in parallel.
   
   I promise and will make sure on:
   - Remove Databricks specific APIs such as `read_delta`, `to_delta` and 
MLflow related APIs
   - Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in 
documentation, and the docstrings and comments in the main codes.
   - Triage APIs to remove that don’t make sense when Koalas is in PySpark
   
   The documentation changes will be tracked in 
[SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code 
changes will be tracked in 
[SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886).
   
   ### Why are the changes needed?
   
   Please refer to:
   - [[DISCUSS] Support pandas API layer on 
PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html)
   - [[VOTE] SPIP: Support pandas API layer on 
PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html)
 is passed.
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, now users can use the pandas APIs on Spark
   
   
   ### How was this patch tested?
   
   Manually tested for exposed major APIs and options as described above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] itholic opened a new pull request #32036: [SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark

Reply via email to