[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37564: [SPARK-40135][PS] Support `data` mixed with `index` in DataFrame creation

GitBox Sun, 28 Aug 2022 19:55:43 -0700


HyukjinKwon commented on code in PR #37564:
URL: https://github.com/apache/spark/pull/37564#discussion_r956853668



##########
python/pyspark/pandas/frame.py:
##########
@@ -411,56 +419,158 @@ class DataFrame(Frame, Generic[T]):
 
     Constructing DataFrame from numpy ndarray:
 
-    >>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
-    ...                    columns=['a', 'b', 'c', 'd', 'e'])
-    >>> df2  # doctest: +SKIP
+    >>> import numpy as np
+    >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+    ...     columns=['a', 'b', 'c', 'd', 'e'])
+       a  b  c  d  e
+    0  1  2  3  4  5
+    1  6  7  8  9  0
+
+    Constructing DataFrame from numpy ndarray with Pandas index:
+
+    >>> import numpy as np
+    >>> import pandas as pd
+
+    >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+    ...     index=pd.Index([1, 4]), columns=['a', 'b', 'c', 'd', 'e'])
        a  b  c  d  e
-    0  3  1  4  9  8
-    1  4  8  4  8  4
-    2  7  6  5  6  7
-    3  8  7  9  1  0
-    4  2  5  4  3  9
+    1  1  2  3  4  5
+    4  6  7  8  9  0
+
+    Constructing DataFrame from numpy ndarray with pandas-on-Spark index:

Review Comment:
   ```suggestion
       Constructing DataFrame from NumPy ndarray with pandas-on-Spark index:
   ```



##########
python/pyspark/pandas/frame.py:
##########
@@ -375,6 +373,16 @@ class DataFrame(Frame, Generic[T]):
     copy : boolean, default False
         Copy data from inputs. Only affects DataFrame / 2d ndarray input
 
+    .. versionchanged:: 3.4.0
+        Since 3.4.0, it deals with `data` and `index` in this approach:
+        1, when `data` is a distributed dataset (Internal DataFrame/Spark 
DataFrame/
+        pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first 
parallize
+        the `index` if necessary, and then try to combine the `data` and 
`index`;
+        Note that in this case `compute.ops_on_diff_frames` should be turned 
on;
+        2, when `data` is a local dataset (Pandas DataFrame/numpy 
ndarray/list/etc),
+        it will first collect the `index` to driver if necessary, and then 
apply
+        the `Pandas.DataFrame(...)` creation internally;

Review Comment:
   ```suggestion
           2. when `data` is a local dataset (pandas DataFrame, NumPy ndarray, 
list, etc.),
               it will first collect the `index` to driver if necessary, and 
then apply
               the `pandas.DataFrame(...)` creation internally;
   ```



##########
python/pyspark/pandas/frame.py:
##########
@@ -359,11 +359,9 @@ class DataFrame(Frame, Generic[T]):
 
     Parameters
     ----------
-    data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame, 
Spark DataFrame \
-        or pandas-on-Spark Series
+    data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,

Review Comment:
   ```suggestion
       data : NumPy ndarray (structured or homogeneous), dict, pandas DataFrame,
   ```



##########
python/pyspark/pandas/frame.py:
##########
@@ -375,6 +373,16 @@ class DataFrame(Frame, Generic[T]):
     copy : boolean, default False
         Copy data from inputs. Only affects DataFrame / 2d ndarray input
 
+    .. versionchanged:: 3.4.0
+        Since 3.4.0, it deals with `data` and `index` in this approach:
+        1, when `data` is a distributed dataset (Internal DataFrame/Spark 
DataFrame/
+        pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first 
parallize
+        the `index` if necessary, and then try to combine the `data` and 
`index`;
+        Note that in this case `compute.ops_on_diff_frames` should be turned 
on;

Review Comment:
   ```suggestion
           1. when `data` is a distributed dataset (internal DataFrame, PySpark 
DataFrame,
               pandas-on-Spark DataFrame, and pandas-on-Spark Series), it will 
first parallelize
               the `index` if necessary, and then try to combine the `data` and 
`index`;
               Note that in this case `compute.ops_on_diff_frames` should be 
turned on;
   ```



##########
python/pyspark/pandas/frame.py:
##########
@@ -411,56 +419,158 @@ class DataFrame(Frame, Generic[T]):
 
     Constructing DataFrame from numpy ndarray:
 
-    >>> df2 = ps.DataFrame(np.random.randint(low=0, high=10, size=(5, 5)),
-    ...                    columns=['a', 'b', 'c', 'd', 'e'])
-    >>> df2  # doctest: +SKIP
+    >>> import numpy as np
+    >>> ps.DataFrame(data=np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 0]]),
+    ...     columns=['a', 'b', 'c', 'd', 'e'])
+       a  b  c  d  e
+    0  1  2  3  4  5
+    1  6  7  8  9  0
+
+    Constructing DataFrame from numpy ndarray with Pandas index:

Review Comment:
   ```suggestion
       Constructing DataFrame from NumPy ndarray with pandas index:
   ```



##########
python/pyspark/pandas/frame.py:
##########
@@ -359,11 +359,9 @@ class DataFrame(Frame, Generic[T]):
 
     Parameters
     ----------
-    data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame, 
Spark DataFrame \
-        or pandas-on-Spark Series
+    data : numpy ndarray (structured or homogeneous), dict, pandas DataFrame,
+        Spark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark Series.

Review Comment:
   ```suggestion
           PySpark DataFrame, pandas-on-Spark DataFrame or pandas-on-Spark 
Series.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37564: [SPARK-40135][PS] Support `data` mixed with `index` in DataFrame creation

Reply via email to