HyukjinKwon opened a new pull request #33954:
URL: https://github.com/apache/spark/pull/33954
### What changes were proposed in this pull request?
This PR proposes new syntax to specify the index type and name in pandas API
on Spark. This is a base work for SPARK-36707.
More specifically, users now can use the type hints when typing as below:
```
pd.DataFrame[int, [int, int]]
pd.DataFrame[pdf.index.dtype, pdf.dtypes]
pd.DataFrame[("index", int), [("id", int), ("A", int)]]
pd.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)]
```
Note that the types of `[("id", int), ("A", int)]` or `("index", int)` are
matched to how you provide a compound NumPy type (see also
https://numpy.org/doc/stable/user/basics.rec.html#introduction).
Therefore, the syntax will be:
**Without index:**
```
pd.DataFrame[type, type, ...]
pd.DataFrame[name: type, name: type, ...]
pd.DataFrame[dtypes instance]
pd.DataFrame[zip(names, types)]
```
**With index:**
```
pd.DataFrame[index_type, [type, ...]]
pd.DataFrame[(index_name, index_type), [(name, type), ...]]
pd.DataFrame[dtype instance, dtypes instance]
pd.DataFrame[(index_name, index_type), zip(names, types)]
```
### Why are the changes needed?
Currently, there is no way to specify the type hint for index type - the
type hints are converted to return type of pandas UDFs internally. Therefore,
we always attach default index which degrade performance:
```python
>>> def transform(pdf) -> pd.DataFrame[int, int]:
... pdf['A'] = pdf.id + 1
... return pdf
...
>>> ks.range(5).koalas.apply_batch(transform)
```
```
c0 c1
0 0 1
1 1 2
2 2 3
3 3 4
4 4 5
```
The [default
index](https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type)
(for the first column that looks unnamed) is attached when the type hint is
specified. For better performance, we should have a way to work around, see
also [Specify the index column in conversion from Spark DataFrame to Koalas
DataFrame](https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#specify-the-index-column-in-conversion-from-spark-dataframe-to-koalas-dataframe).
Note that this still remains as experimental because Python itself yet
doesn't support such kind of typing out of the box. Once pandas completes
typing support like NumPy did in `numpy.typing`, we should implement Koalas
typing package, and migrate to it with leveraging pandas' typing way.
### Does this PR introduce _any_ user-facing change?
No, this PR does not yet cause user-facing behavior.
### How was this patch tested?
Unittests were added.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]