[GitHub] [spark] zhengruifeng commented on a diff in pull request #37564: [SPARK-40135][PS] Support `data` mixed with `index` in DataFrame creation

GitBox Fri, 26 Aug 2022 20:21:07 -0700


zhengruifeng commented on code in PR #37564:
URL: https://github.com/apache/spark/pull/37564#discussion_r956530408



##########
python/pyspark/pandas/frame.py:
##########
@@ -375,6 +373,16 @@ class DataFrame(Frame, Generic[T]):
     copy : boolean, default False
         Copy data from inputs. Only affects DataFrame / 2d ndarray input
 
+    .. versionchanged:: 3.4.0
+    Since 3.4.0, it deals with `data` and `index` in this approach:
+    1, when `data` is a distributed dataset (Internal DataFrame/Spark 
DataFrame/
+    pandas-on-Spark DataFrame/pandas-on-Spark Series), it will first parallize
+    the `index` if necessary, and then try to combine the `data` and `index`;
+    Note that in this case `compute.ops_on_diff_frames` should be turned on;
+    2, when `data` is a local dataset (Pandas DataFrame/numpy 
ndarray/list/etc),
+    it will first collect the `index` to driver if necessary, and then apply
+    the `Pandas.DataFrame(...)` creation internally;

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhengruifeng commented on a diff in pull request #37564: [SPARK-40135][PS] Support `data` mixed with `index` in DataFrame creation

Reply via email to