[GitHub] [spark] wangshuo128 commented on a change in pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects

GitBox Wed, 07 Apr 2021 02:27:15 -0700


wangshuo128 commented on a change in pull request #31968:
URL: https://github.com/apache/spark/pull/31968#discussion_r608490932




##########
File path: 
sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala
##########
@@ -64,9 +65,15 @@ private[hive] class SparkSQLDriver(val context: SQLContext = 
SparkSQLEnv.sqlCont
         new VariableSubstitution().substitute(command)
       }
       context.sparkContext.setJobDescription(substitutorCommand)
-      val execution = 
context.sessionState.executePlan(context.sql(command).logicalPlan)
-      hiveResponse = SQLExecution.withNewExecutionId(execution) {

Review comment:
       When run `sql("show tables").collect`, there are 2 same SQL queries in 
SQL UI: execute `ShowTablesCommand `. This is different from running "show 
tables" in spark-sql.
   
   The reason is that:
   1. run "show tables" in spark-sql
   we create a DataFrame from the "show tables" SQL,  its `logicalPlan` is a 
`LocalRelation`. Then create a `QueryExecution` from the `LocalRelation` and 
convert it to Hive result string.
   see`SaprkSQLDriver.run`
   
https://github.com/apache/spark/blob/06c09a79b371c5ac3e4ebad1118ed94b460f48d1/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLDriver.scala#L67-L70
   the query plan is to collect from `LocalRelation`
   
   2. run `sql.("show tables").collect`
   When call collect, the `Dataset` directly use its own `QueryExecution` , see 
https://github.com/apache/spark/blob/06c09a79b371c5ac3e4ebad1118ed94b460f48d1/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L2983
   the query plan is just to execute the `ShowTablesCommand`
   
   So it seems that we can't solve the problem by checking the 
`LocalRelation.fromCommand` in `SQLExecution. withNewExecutionId `
   
   I'm thinking another way:
   1. unify the behavior in `SaprkSQLDriver.run` and `Dataset.collect`
   don't create extra `QueryExecution` in `SaprkSQLDriver.run` (what's the 
original purpose to create a new `QueryExecution`?)
   ``` 
   val execution = context.sql(command).queryExecution
   ```
   2. add a `isCommand` flag in `Dataset`
   3. pass the `Dataset` to `SQLExecution. withNewExecutionId` and check 
`Dataset.isCommand`
   
   WDYT?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] wangshuo128 commented on a change in pull request #31968: [SPARK-34873][SQL] Avoid wrapped in withNewExecutionId twice when run SQL with side effects

Reply via email to