Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

via GitHub Wed, 22 May 2024 12:40:35 -0700


ianmcook commented on code in PR #46529:
URL: https://github.com/apache/spark/pull/46529#discussion_r1610552133



##########
examples/src/main/python/sql/arrow.py:
##########
@@ -33,20 +33,23 @@
 require_minimum_pyarrow_version()
 
 
-def dataframe_to_arrow_table_example(spark: SparkSession) -> None:
-    import pyarrow as pa  # noqa: F401
-    from pyspark.sql.functions import rand
+def dataframe_to_from_arrow_table_example(spark: SparkSession) -> None:
+    import pyarrow as pa
+    import numpy as np
+
+    # Create a PyArrow Table
+    table = pa.table([pa.array(np.random.rand(100)) for i in range(3)], 
names=["a", "b", "c"])
 
-    # Create a Spark DataFrame
-    df = spark.range(100).drop("id").withColumns({"0": rand(), "1": rand(), 
"2": rand()})
+    # Create a Spark DataFrame from the PyArrow Table
+    df = spark.createDataFrame(table)
 
     # Convert the Spark DataFrame to a PyArrow Table
-    table = df.select("*").toArrow()
+    result_table = df.select("*").toArrow()

Review Comment:
   I suspect that the original purpose of the `.select("*")` was to represent 
some arbitrary transformations being lazily performed on the dataframe. That 
way users will know that this works when there are transformations.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48220][PYTHON] Allow passing PyArrow Table to createDataFrame() [spark]

Reply via email to