[jira] [Resolved] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

angerszhu (Jira) Fri, 23 Jul 2021 00:29:34 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


angerszhu resolved SPARK-29018.
-------------------------------
    Resolution: Won't Fix

> Build spark thrift server on it's own code based on protocol v11
> ----------------------------------------------------------------
>
>                 Key: SPARK-29018
>                 URL: https://issues.apache.org/jira/browse/SPARK-29018
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: angerszhu
>            Priority: Major
>
> h2. Background
>     With the development of Spark and Hive，in current sql/hive-thriftserver 
> module, we need to do a lot of work to solve code conflicts for different 
> built-in hive versions. It's an annoying and unending work in current ways. 
> And these issues have limited our ability and convenience to develop new 
> features for Spark’s thrift server. 
>     We suppose to implement a new thrift server and JDBC driver based on 
> Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift 
> server have below feature:
>  # Build new module spark-service as spark’s thrift server 
>  # Don't need as much reflection and inherited code as `hive-thriftser` 
> modules
>  # Support all functions current `sql/hive-thriftserver` support
>  # Use all code maintained by spark itself, won’t depend on Hive
>  # Support origin functions use spark’s own way, won't limited by Hive's code
>  # Support running without hive metastore or with hive metastore
>  # Support user impersonation by Multi-tenant splited hive authentication and 
> DFS authentication
>  # Support session hook for with spark’s own code
>  # Add a new jdbc driver spark-jdbc, with spark’s own connection url  
> “jdbc:spark:<host>:<port>/<db>”
>  # Support both hive-jdbc and spark-jdbc client, then we can support most 
> clients and BI platform
> h2. How to start?
>      We can start this new thrift server by shell 
> *sbin/start-spark-thriftserver.sh* and stop it by 
> *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to 
> determine the characteristics of the current spark thrift server service, we  
> have implemented all need configuration by spark itself in 
> `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used 
> to connect to hive metastore. We can write all we needed conf in 
> *conf/spark-default.conf* or in startup command *--conf*
> h2. How to connect through jdbc?
>    Now we support both hive-jdbc and spark-jdbc, user can choose which one he 
> likes
> h3. spark-jdbc
>  # use `SparkDriver` as jdbc driver class
>  # Connection url 
> `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
>  most samse as hive but with spark’s special url prefix `jdbc:spark`
>  # For proxy, use SparkDriver should set proxy conf 
> `spark.sql.thriftserver.proxy.user=username` 
> h3. hive-jdbc
>  # use `HiveDriver` as jdbc driver class
>  # connection str 
> jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>   as origin 
>  # For proxy, use HiveDriver should set proxy conf 
> hive.server2.proxy.user=username, current server support both config
> h2. How is it done today, and what are the limits of current practice?
> h3. Current practice
> We have completed two modules `spark-service` & `spark-jdbc` now, it can run 
> well  and we have add origin UT to it these two module and it can pass the 
> UT, for impersonation, we have write the code and test it in our kerberized 
> environment, it can work well and wait for review. Now we will raise pr to 
> apace/spark master branch step by step.
> h3. Here are some known changes:
>  # Don’t use any hive code in `spark-service` `spark-jdbc` module
>  # In current service, default rcfile suffix  `.hiverc` was replaced by 
> `.sparkrc`
>  # When use SparkDriver as jdbc driver class, url should use 
> jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
>  # When use SparkDriver as jdbc driver class, proxy conf should be 
> `spark.sql.thriftserver.proxy.user=proxy_user_name`
>  # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
> h2. What are the risks?
>     Totally new module, won’t change other module’s code except for 
> supporting impersonation. Except impersonation, we have added a lot of UT 
> changed (fit grammar without hive) from origin UT, and all pass it. For 
> impersonation I have test it in our kerberized environment but still need 
> detail review since change a lot.
> h2. How long will it take?
>        We have done all these works in our own repo, now we plan merge our 
> code into the master step by step.
>  # Phase1 pr about build new module *spark-service* on folder *sql/service*
>  # Phase2 pr thrift protocol and generated thrift protocol java code
>  # Phase3 pr with all *spark-service* module code and description about 
> design, also Unnit Test
>  # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
>  # Phase5 pr with all *spark-jdbc* module code and Unit Tests
>  # Phase6 pr about support thriftserver Impersonation
>  # Phase7 pr about build spark's own beeline client *spark-beeline*
>  # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI* 
> module named *spark-cli*
> h3. Appendix A. Proposed API Changes. Optional section defining APIs changes, 
> if any. Backward and forward compatibility must be taken into account.
> Compared to current `sql/hive-thriftserver`,  corresponding API changes as 
> below:
>  
>  # Add a new class org.apache.spark.sql.service.internal.ServiceConnf, 
> contains all needed configuration for spark thrift server
>  # ServiceSessionXxx as origin HiveSessionXxx
>  # In ServiceSessionImpl, remove  code spark won’t use
>  # In ServiceSessionImpl set session conf directly to sqlConf  like 
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
>  # Remove SparkSQLSessionManager, add logic into SessionMananger
>  # Implement all OperationMananegr logic into SparkSQLOperationMananger and 
> rename it to OperationManager
>  # Add SQLContext to ServiceSessionImpl  as it’s variable, don’t pass it by 
> SparkSQLOperationManager, just get it by parentSession.getSqlContext() 
> session conf was set to this sqlContext.sqlConf
>  # Remove HiveServer2 since we don’t need the logic in it
>  # Remove logic about hive impersonation since it won’t be useful in spark 
> thrift server and remove parameter delegationTokenStr in 
> ServiceSessionImplWithUGI 
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353]
>    we will use new way for spark’s impersonation.
>  # Remove ThriftserverShimUtils, since we don’t need this
>  # Remove SparkSQLCLIService just use CLIService 
>  # Remove ReflectionUtils and ReflactCompositeService since we don’t need 
> interition and reflection



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (SPARK-29018) Build spark thrift server on it's own code based on protocol v11

Reply via email to