risinga commented on a change in pull request #34359:
URL: https://github.com/apache/spark/pull/34359#discussion_r741859708
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
createDataset(data.asScala.toSeq)
}
+ /**
+ * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This
method allows
+ * the caller to create externally the InternalRow set, as we as define the
schema externally.
+ *
+ * @since 3.3.0
+ */
+ def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {
Review comment:
@HyukjinKwon - should I go ahead and remove this method for the time
being, so we can move forward with this PR? Or is there an opportunity to
improve elsewhere? Please let me know your thoughts.
Happy to brainstorm further!
##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
createDataset(data.asScala.toSeq)
}
+ /**
+ * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This
method allows
+ * the caller to create externally the InternalRow set, as we as define the
schema externally.
+ *
+ * @since 3.3.0
+ */
+ def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {
Review comment:
Hello @HyukjinKwon, I'm trying to connect the dots, and map your
suggestion with the set goals.
I agree, internal rows are by principle internal and generated off the
conversion between the external RDD types and the internal types. The main
intention with this change, as mentioned on the PR description:
1 - allow additional flexibility when creating a typed DataSet, and
corresponding rows outside Spark. The first change on StructType helps, because
avoids unnecessary loops when building the rows.
2 - Improve overall performance with the external building of Spark rows. By
passing the InternalRows in the dataset creation, we're bypassing the
conversion to Catalyst rows and improving the general performance of the driver.
So, assuming this not a path we want to pursuit - creating internalRows
outside - we still need to create a dataframe based on the exposed Row type (at
least that is how we need to do it in our project).
For that we can just used the existing method createDataFrame(rowRDD:
RDD[Row], schema: StructType): DataFrame
The above suggestion is assuming there is already a dataframe from which we
would retrieve the Spark plan with the internal rows. What is exactly the
intention with creating a DF from another DF if there isn't any change on data
and schema? Apologies if it's an obvious answer, and I didn't get there yet.
I'm a Spark contributor newbie :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]