davidm-db commented on code in PR #47403:
URL: https://github.com/apache/spark/pull/47403#discussion_r1687861052
##########
sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala:
##########
@@ -650,14 +657,27 @@ class SparkSession private(
private[sql] def sql(sqlText: String, args: Array[_], tracker:
QueryPlanningTracker): DataFrame =
withActive {
val plan = tracker.measurePhase(QueryPlanningTracker.PARSING) {
- val parsedPlan = sessionState.sqlParser.parsePlan(sqlText)
- if (args.nonEmpty) {
- PosParameterizedQuery(parsedPlan,
args.map(lit(_).expr).toImmutableArraySeq)
- } else {
- parsedPlan
+ val parsedPlan = sessionState.sqlParser.parseScript(sqlText)
+ parsedPlan match {
+ case CompoundBody(Seq(singleStmtPlan: SingleStatement), label) if
args.nonEmpty =>
+ CompoundBody(Seq(SingleStatement(
+ PosParameterizedQuery(
+ singleStmtPlan.parsedPlan,
args.map(lit(_).expr).toImmutableArraySeq))), label)
+ case p =>
+ assert(args.isEmpty, "Named parameters are not supported for batch
queries")
+ p
}
}
- Dataset.ofRows(self, plan, tracker)
+
+ plan match {
+ case CompoundBody(Seq(singleStmtPlan: SingleStatement), _) =>
+ Dataset.ofRows(self, singleStmtPlan.parsedPlan, tracker)
+ case _ =>
+ // execute the plan directly if it is not a single statement
+ val lastRow = executeScript(plan).foldLeft(Array.empty[Row])((_,
next) => next)
+ val attributes = DataTypeUtils.toAttributes(lastRow.head.schema)
+ Dataset.ofRows(self, LocalRelation.fromExternalRows(attributes,
lastRow.toIndexedSeq))
Review Comment:
this is a bit of a hard topic at the moment... we decided on this approach
for preview for multiple reasons:
- this what we will do for initial version of the changes for Spark Connect
execution
- discussions around multiple topics are still open:
- multiple results API - decisions around this will affect single results
API as well
- multiple results API - from correctness perspective, all statements
(including SELECTs) need to be executed eagerly. It makes sense then to have
the same behavior with single results API as well - in this case all statements
are executed eagerly, but results for all of them except the last one are
dropped.
- there is still an open discussion whether to include `return` statement
- a ton of questions about stored procedures are still an open topic
what will probably happen down the line is:
- sql() API remains unchanged and only last DataFrame is returned (as you
suggested). Requires still a lot of work to support Connect execution, current
approach works with Connect already.
- [optional] new API to do what we are doing at the moment.
- new API for multiple results, stored procedures, execute immediate, etc.
since the last part is still an open question, we figured out that we will
do a simplest thing that works e2e in all cases and then, after we gather
initial feedback from preview, and understand better what we want to do for
stored procedures/multiple results, we should actually commit to implement all
of the API changes.
please let us know your thoughts on this.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]