[GitHub] [spark] risinga commented on a change in pull request #34359: [SPARK-36986][SQL] Improving external schema management flexibility on DataSet and StructType

GitBox Wed, 03 Nov 2021 17:58:00 -0700


risinga commented on a change in pull request #34359:
URL: https://github.com/apache/spark/pull/34359#discussion_r741859708




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
     createDataset(data.asScala.toSeq)
   }
 
+  /**
+   * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
+   * the caller to create externally the InternalRow set, as we as define the 
schema externally.
+   *
+   * @since 3.3.0
+   */
+  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {

Review comment:
       @HyukjinKwon - should I go ahead and remove this method for the time 
being, so we can move forward with this PR? Or is there an opportunity to 
improve elsewhere? Please let me know your thoughts.
   Happy to brainstorm further!

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
     createDataset(data.asScala.toSeq)
   }
 
+  /**
+   * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
+   * the caller to create externally the InternalRow set, as we as define the 
schema externally.
+   *
+   * @since 3.3.0
+   */
+  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {

Review comment:
       Hello @HyukjinKwon, I'm trying to connect the dots, and map your 
suggestion with the set goals.
   I agree, internal rows are by principle internal and generated off the 
conversion between the external RDD types and the internal types. The main 
intention with this change, as mentioned on the PR description:
   
   1 - allow additional flexibility when creating a typed DataSet, and 
corresponding rows outside Spark. The first change on StructType helps, because 
avoids unnecessary loops when building the rows.
   
   2 - Improve overall performance with the external building of Spark rows. By 
passing the InternalRows in the dataset creation, we're bypassing the 
conversion to Catalyst rows and improving the general performance of the driver.
   So, assuming this not a path we want to pursuit - creating internalRows 
outside - we still need to create a dataframe based on the exposed Row type (at 
least that is how we need to do it in our project).
   For that we can just used the existing method createDataFrame(rowRDD: 
RDD[Row], schema: StructType): DataFrame 
   
   The above suggestion is assuming there is already a dataframe from which we 
would retrieve the Spark plan with the internal rows. What is exactly the 
intention with creating a DF from another DF if there isn't any change on data 
and schema? Apologies if it's an obvious answer, and I didn't get there yet. 
I'm a Spark contributor newbie :)
   
   
   

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
     createDataset(data.asScala.toSeq)
   }
 
+  /**
+   * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
+   * the caller to create externally the InternalRow set, as we as define the 
schema externally.
+   *
+   * @since 3.3.0
+   */
+  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {

Review comment:
       @HyukjinKwon - should I go ahead and remove this method for the time 
being, so we can move forward with this PR? Or is there an opportunity to 
improve elsewhere? Please let me know your thoughts.
   Happy to brainstorm further!

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala
##########
@@ -511,6 +511,20 @@ class SparkSession private(
     createDataset(data.asScala.toSeq)
   }
 
+  /**
+   * Creates a [[Dataset]] from an RDD of spark.sql.catalyst.InternalRow. This 
method allows
+   * the caller to create externally the InternalRow set, as we as define the 
schema externally.
+   *
+   * @since 3.3.0
+   */
+  def createDataset(data: RDD[InternalRow], schema: StructType): DataFrame = {

Review comment:
       Hello @HyukjinKwon, I'm trying to connect the dots, and map your 
suggestion with the set goals.
   I agree, internal rows are by principle internal and generated off the 
conversion between the external RDD types and the internal types. The main 
intention with this change, as mentioned on the PR description:
   
   1 - allow additional flexibility when creating a typed DataSet, and 
corresponding rows outside Spark. The first change on StructType helps, because 
avoids unnecessary loops when building the rows.
   
   2 - Improve overall performance with the external building of Spark rows. By 
passing the InternalRows in the dataset creation, we're bypassing the 
conversion to Catalyst rows and improving the general performance of the driver.
   So, assuming this not a path we want to pursuit - creating internalRows 
outside - we still need to create a dataframe based on the exposed Row type (at 
least that is how we need to do it in our project).
   For that we can just used the existing method createDataFrame(rowRDD: 
RDD[Row], schema: StructType): DataFrame 
   
   The above suggestion is assuming there is already a dataframe from which we 
would retrieve the Spark plan with the internal rows. What is exactly the 
intention with creating a DF from another DF if there isn't any change on data 
and schema? Apologies if it's an obvious answer, and I didn't get there yet. 
I'm a Spark contributor newbie :)
   
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] risinga commented on a change in pull request #34359: [SPARK-36986][SQL] Improving external schema management flexibility on DataSet and StructType

Reply via email to