Re: [PR] [SPARK-48344][SQL] SQL API change to support execution of compound statements [spark]

via GitHub Tue, 23 Jul 2024 06:53:08 -0700


davidm-db commented on code in PR #47403:
URL: https://github.com/apache/spark/pull/47403#discussion_r1687861052



##########
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##########
@@ -650,14 +657,27 @@ class SparkSession private(
   private[sql] def sql(sqlText: String, args: Array[_], tracker: 
QueryPlanningTracker): DataFrame =
     withActive {
       val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
-        val parsedPlan = sessionState.sqlParser.parsePlan(sqlText)
-        if (args.nonEmpty) {
-          PosParameterizedQuery(parsedPlan, 
args.map(lit(_).expr).toImmutableArraySeq)
-        } else {
-          parsedPlan
+        val parsedPlan = sessionState.sqlParser.parseScript(sqlText)
+        parsedPlan match {
+          case CompoundBody(Seq(singleStmtPlan: SingleStatement), label) if 
args.nonEmpty =>
+            CompoundBody(Seq(SingleStatement(
+              PosParameterizedQuery(
+                singleStmtPlan.parsedPlan, 
args.map(lit(_).expr).toImmutableArraySeq))), label)
+          case p =>
+            assert(args.isEmpty, "Named parameters are not supported for batch 
queries")
+            p
         }
       }
-      Dataset.ofRows(self, plan, tracker)
+
+      plan match {
+        case CompoundBody(Seq(singleStmtPlan: SingleStatement), _) =>
+          Dataset.ofRows(self, singleStmtPlan.parsedPlan, tracker)
+        case _ =>
+          // execute the plan directly if it is not a single statement
+          val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_, 
next) => next)
+          val attributes = DataTypeUtils.toAttributes(lastRow.head.schema)
+          Dataset.ofRows(self, LocalRelation.fromExternalRows(attributes, 
lastRow.toIndexedSeq))

Review Comment:
   this is a bit of a hard topic at the moment... we decided on this approach 
for preview for multiple reasons:
   - this what we will do for initial version of the changes for Spark Connect 
execution
   - discussions around multiple topics are still open:
     - multiple results API - decisions around this will affect single results 
API as well
     - multiple results API - from correctness perspective, all statements 
(including SELECTs) need to be executed eagerly. It makes sense then to have 
the same behavior with single results API as well - in this case all statements 
are executed eagerly, but results for all of them except the last one are 
dropped.
     - there is still an open discussion whether to include `return` statement
     - a ton of questions about stored procedures are still an open topic
   
   what will probably happen down the line is:
   - sql() API remains unchanged and only last DataFrame is returned (as you 
suggested). Requires still a lot of work to support Connect execution, current 
approach works with Connect already.
   - [optional] new API to do what we are doing at the moment.
   - new API for multiple results, stored procedures, execute immediate, etc.
   
   since the last part is still an open question, we figured out that we will 
do a simplest thing that works e2e in all cases and then, after we gather 
initial feedback from preview, and understand better what we want to do for 
stored procedures/multiple results, we should actually commit to implement all 
of the API changes.
   
   please let us know your thoughts on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-48344][SQL] SQL API change to support execution of compound statements [spark]

Reply via email to