[GitHub] [spark] HyukjinKwon opened a new pull request #33954: [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark

GitBox Fri, 10 Sep 2021 00:46:07 -0700


HyukjinKwon opened a new pull request #33954:
URL: https://github.com/apache/spark/pull/33954



   ### What changes were proposed in this pull request?
   
   This PR proposes new syntax to specify the index type and name in pandas API 
on Spark. This is a base work for SPARK-36707.
   
   More specifically, users now can use the type hints when typing as below:
   
   ```
   pd.DataFrame[int, [int, int]]
   pd.DataFrame[pdf.index.dtype, pdf.dtypes]
   pd.DataFrame[("index", int), [("id", int), ("A", int)]]
   pd.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)]
   ```
   
   Note that the types of `[("id", int), ("A", int)]` or  `("index", int)` are 
matched to how you provide a compound NumPy type (see also 
https://numpy.org/doc/stable/user/basics.rec.html#introduction).
   
   Therefore, the syntax will be:
   
   **Without index:**
   
   ```
   pd.DataFrame[type, type, ...]
   pd.DataFrame[name: type, name: type, ...]
   pd.DataFrame[dtypes instance]
   pd.DataFrame[zip(names, types)]
   ```
   
   **With index:**
   
   ```
   pd.DataFrame[index_type, [type, ...]]
   pd.DataFrame[(index_name, index_type), [(name, type), ...]]
   pd.DataFrame[dtype instance, dtypes instance]
   pd.DataFrame[(index_name, index_type), zip(names, types)]
   ```
   
   ### Why are the changes needed?
   
   Currently, there is no way to specify the type hint for index type - the 
type hints are converted to return type of pandas UDFs internally. Therefore, 
we always attach default index which degrade performance:
   
   ```python
   >>> def transform(pdf) -> pd.DataFrame[int, int]:
   ...     pdf['A'] = pdf.id + 1
   ...     return pdf
   ...
   >>> ks.range(5).koalas.apply_batch(transform)
   ```
   
   ```
      c0  c1
   0   0   1
   1   1   2
   2   2   3
   3   3   4
   4   4   5
   ```
   
   The [default 
index](https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type)
 (for the first column that looks unnamed) is attached when the type hint is 
specified. For better performance, we should have a way to work around, see 
also [Specify the index column in conversion from Spark DataFrame to Koalas 
DataFrame](https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#specify-the-index-column-in-conversion-from-spark-dataframe-to-koalas-dataframe).
   
   Note that this still remains as experimental because Python itself yet 
doesn't support such kind of typing out of the box. Once pandas completes 
typing support like NumPy did in `numpy.typing`, we should implement Koalas 
typing package, and migrate to it with leveraging pandas' typing way.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No, this PR does not yet cause user-facing behavior.
   
   ### How was this patch tested?
   
   Unittests were added.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon opened a new pull request #33954: [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark

Reply via email to