HyukjinKwon commented on code in PR #37564:
URL: https://github.com/apache/spark/pull/37564#discussion_r956853668
##########
python/pyspark/pandas/frame.py:
##########
@@ -411,56 +419,158 @@ class DataFrame(Frame, Generic[T]):
Constructing DataFrame from numpy ndarray:
- >>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
- ... columns=['a', 'b', 'c', 'd', 'e'])
- >>> df2 # doctest: +SKIP
+ >>> import numpy as np
+ >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+ ... columns=['a', 'b', 'c', 'd', 'e'])
+ a b c d e
+ 0 1 2 3 4 5
+ 1 6 7 8 9 0
+
+ Constructing DataFrame from numpy ndarray with Pandas index:
+
+ >>> import numpy as np
+ >>> import pandas as pd
+
+ >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+ ... index=pd.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
a b c d e
- 0 3 1 4 9 8
- 1 4 8 4 8 4
- 2 7 6 5 6 7
- 3 8 7 9 1 0
- 4 2 5 4 3 9
+ 1 1 2 3 4 5
+ 4 6 7 8 9 0
+
+ Constructing DataFrame from numpy ndarray with pandas-on-Spark index:
Review Comment:
```suggestion
Constructing DataFrame from NumPy ndarray with pandas-on-Spark index:
```
##########
python/pyspark/pandas/frame.py:
##########
@@ -375,6 +373,16 @@ class DataFrame(Frame, Generic[T]):
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
+ .. versionchanged:: 3.4.0
+ Since 3.4.0, it deals with `data` and `index` in this approach:
+ 1, when `data` is a distributed dataset (Internal DataFrame/Spark
DataFrame/
+ pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first
parallize
+ the `index` if necessary, and then try to combine the `data` and
`index`;
+ Note that in this case `compute.ops_on_diff_frames` should be turned
on;
+ 2, when `data` is a local dataset (Pandas DataFrame/numpy
ndarray/list/etc),
+ it will first collect the `index` to driver if necessary, and then
apply
+ the `Pandas.DataFrame(...)` creation internally;
Review Comment:
```suggestion
2. when `data` is a local dataset (pandas DataFrame, NumPy ndarray,
list, etc.),
it will first collect the `index` to driver if necessary, and
then apply
the `pandas.DataFrame(...)` creation internally;
```
##########
python/pyspark/pandas/frame.py:
##########
@@ -359,11 +359,9 @@ class DataFrame(Frame, Generic[T]):
Parameters
----------
- data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,
Spark DataFrame \
- or pandas-on-Spark Series
+ data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,
Review Comment:
```suggestion
data : NumPy ndarray (structured or homogeneous), dict, pandas DataFrame,
```
##########
python/pyspark/pandas/frame.py:
##########
@@ -375,6 +373,16 @@ class DataFrame(Frame, Generic[T]):
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input
+ .. versionchanged:: 3.4.0
+ Since 3.4.0, it deals with `data` and `index` in this approach:
+ 1, when `data` is a distributed dataset (Internal DataFrame/Spark
DataFrame/
+ pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first
parallize
+ the `index` if necessary, and then try to combine the `data` and
`index`;
+ Note that in this case `compute.ops_on_diff_frames` should be turned
on;
Review Comment:
```suggestion
1. when `data` is a distributed dataset (internal DataFrame, PySpark
DataFrame,
pandas-on-Spark DataFrame, and pandas-on-Spark Series), it will
first parallelize
the `index` if necessary, and then try to combine the `data` and
`index`;
Note that in this case `compute.ops_on_diff_frames` should be
turned on;
```
##########
python/pyspark/pandas/frame.py:
##########
@@ -411,56 +419,158 @@ class DataFrame(Frame, Generic[T]):
Constructing DataFrame from numpy ndarray:
- >>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
- ... columns=['a', 'b', 'c', 'd', 'e'])
- >>> df2 # doctest: +SKIP
+ >>> import numpy as np
+ >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+ ... columns=['a', 'b', 'c', 'd', 'e'])
+ a b c d e
+ 0 1 2 3 4 5
+ 1 6 7 8 9 0
+
+ Constructing DataFrame from numpy ndarray with Pandas index:
Review Comment:
```suggestion
Constructing DataFrame from NumPy ndarray with pandas index:
```
##########
python/pyspark/pandas/frame.py:
##########
@@ -359,11 +359,9 @@ class DataFrame(Frame, Generic[T]):
Parameters
----------
- data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,
Spark DataFrame \
- or pandas-on-Spark Series
+ data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,
+ Spark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark Series.
Review Comment:
```suggestion
PySpark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark
Series.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]