[GitHub] [spark] cxzl25 commented on issue #23826: [SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session

GitBox Mon, 01 Apr 2019 09:37:49 -0700

cxzl25 commented on issue #23826: [SPARK-26914][SQL] Fix scheduler pool may be 
unpredictable when we only want to use default pool and do not set 
spark.scheduler.pool for the session
URL: https://github.com/apache/spark/pull/23826#issuecomment-478653048
 
 
   @srowen 
   It is the same problem.
   Spark thrift server uses the thread pool (HiveServer2-Handler-Pool) to 
accept the client's thrift request, and ```SparkExecuteStatementOperation``` 
submits the sql to the ```DAGScheduler```. At this time, another thread of the 
```DAGSchedulerEventProcessLoop``` handles the job.
   The ```default``` pool is used when the ```properties```(threadlocal) of 
```DAGScheduler.runJob``` are not configured by ```spark.scheduler.pool```.
   
   
https://github.com/apache/spark/blob/5888b15d9cdf8272012018f39bf58c8faf68a5e1/sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/thrift/ThriftBinaryCLIService.java#L52-L54
   
   
https://github.com/apache/spark/blob/5888b15d9cdf8272012018f39bf58c8faf68a5e1/core/src/main/scala/org/apache/spark/scheduler/SchedulableBuilder.scala#L184-L185
   
   
https://github.com/apache/spark/blob/5888b15d9cdf8272012018f39bf58c8faf68a5e1/core/src/main/scala/org/apache/spark/SparkContext.scala#L1999-L2013
   
   
https://github.com/apache/spark/blob/5888b15d9cdf8272012018f39bf58c8faf68a5e1/core/src/main/scala/org/apache/spark/SparkContext.scala#L642-L647
   
   The client of the same session can have multiple requests, which may be 
handled by different threads of the thread pool (HiveServer2-Handler-Pool). If 
the threadlocal variable is not cleaned up, it may affect the next thread, so 
Need to clean up before exiting this thread.
   
   Because each execution of the same session may be in a different thread, in 
order to ensure that sql can set the pool name in the user through the 
parameter ```spark.sql.thriftserver.scheduler.pool```, you need to set 
```spark.scheduler.pool``` every time you execute it.
   
   When pool==null, the current new connection does not have ```set 
spark.sql.thriftserver.scheduler.pool=xxx```
   
   The current practice becomes that it is a bit strange to clean up the 
variables left by the previous thread by the next thread, and when 
```spark.sql.thriftServer.incrementalCollect=true```, the problem may be 
reproduced again.
   
   session a
   ```sql
   select 1;  --default pool
   set spark.sql.thriftserver.scheduler.pool=xxx;
   select  2;   --xxx pool
   select  3;   --xxx pool
   ```
   session b
   ```sql
   select 1;  --default pool
   select 2;  --default pool
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] cxzl25 commented on issue #23826: [SPARK-26914][SQL] Fix scheduler pool may be unpredictable when we only want to use default pool and do not set spark.scheduler.pool for the session

Reply via email to