[
https://issues.apache.org/jira/browse/SPARK-29018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
angerszhu resolved SPARK-29018.
-------------------------------
Resolution: Won't Fix
> Build spark thrift server on it's own code based on protocol v11
> ----------------------------------------------------------------
>
> Key: SPARK-29018
> URL: https://issues.apache.org/jira/browse/SPARK-29018
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: angerszhu
> Priority: Major
>
> h2. Background
> With the development of Spark and Hive,in current sql/hive-thriftserver
> module, we need to do a lot of work to solve code conflicts for different
> built-in hive versions. It's an annoying and unending work in current ways.
> And these issues have limited our ability and convenience to develop new
> features for Spark’s thrift server.
> We suppose to implement a new thrift server and JDBC driver based on
> Hive’s latest v11 TCLService.thrift thrift protocol. Finally, the new thrift
> server have below feature:
> # Build new module spark-service as spark’s thrift server
> # Don't need as much reflection and inherited code as `hive-thriftser`
> modules
> # Support all functions current `sql/hive-thriftserver` support
> # Use all code maintained by spark itself, won’t depend on Hive
> # Support origin functions use spark’s own way, won't limited by Hive's code
> # Support running without hive metastore or with hive metastore
> # Support user impersonation by Multi-tenant splited hive authentication and
> DFS authentication
> # Support session hook for with spark’s own code
> # Add a new jdbc driver spark-jdbc, with spark’s own connection url
> “jdbc:spark:<host>:<port>/<db>”
> # Support both hive-jdbc and spark-jdbc client, then we can support most
> clients and BI platform
> h2. How to start?
> We can start this new thrift server by shell
> *sbin/start-spark-thriftserver.sh* and stop it by
> *sbin/stop-spark-thriftserver.sh*. Don’t need HiveConf ’s configurations to
> determine the characteristics of the current spark thrift server service, we
> have implemented all need configuration by spark itself in
> `org.apache.spark.sql.service.internal.ServiceConf`, hive-site.xml only used
> to connect to hive metastore. We can write all we needed conf in
> *conf/spark-default.conf* or in startup command *--conf*
> h2. How to connect through jdbc?
> Now we support both hive-jdbc and spark-jdbc, user can choose which one he
> likes
> h3. spark-jdbc
> # use `SparkDriver` as jdbc driver class
> # Connection url
> `jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list`
> most samse as hive but with spark’s special url prefix `jdbc:spark`
> # For proxy, use SparkDriver should set proxy conf
> `spark.sql.thriftserver.proxy.user=username`
> h3. hive-jdbc
> # use `HiveDriver` as jdbc driver class
> # connection str
> jdbc:hive2://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
> as origin
> # For proxy, use HiveDriver should set proxy conf
> hive.server2.proxy.user=username, current server support both config
> h2. How is it done today, and what are the limits of current practice?
> h3. Current practice
> We have completed two modules `spark-service` & `spark-jdbc` now, it can run
> well and we have add origin UT to it these two module and it can pass the
> UT, for impersonation, we have write the code and test it in our kerberized
> environment, it can work well and wait for review. Now we will raise pr to
> apace/spark master branch step by step.
> h3. Here are some known changes:
> # Don’t use any hive code in `spark-service` `spark-jdbc` module
> # In current service, default rcfile suffix `.hiverc` was replaced by
> `.sparkrc`
> # When use SparkDriver as jdbc driver class, url should use
> jdbc:spark://<host1>:<port1>,<host2>:<port2>/dbName;sess_var_list?conf_list#var_list
> # When use SparkDriver as jdbc driver class, proxy conf should be
> `spark.sql.thriftserver.proxy.user=proxy_user_name`
> # Support `hiveconf` `hivevar` session conf through hive-jdbc connection
> h2. What are the risks?
> Totally new module, won’t change other module’s code except for
> supporting impersonation. Except impersonation, we have added a lot of UT
> changed (fit grammar without hive) from origin UT, and all pass it. For
> impersonation I have test it in our kerberized environment but still need
> detail review since change a lot.
> h2. How long will it take?
> We have done all these works in our own repo, now we plan merge our
> code into the master step by step.
> # Phase1 pr about build new module *spark-service* on folder *sql/service*
> # Phase2 pr thrift protocol and generated thrift protocol java code
> # Phase3 pr with all *spark-service* module code and description about
> design, also Unnit Test
> # Phase4 pr about build new module *spark-jdbc* on folder *sql/jdbc*
> # Phase5 pr with all *spark-jdbc* module code and Unit Tests
> # Phase6 pr about support thriftserver Impersonation
> # Phase7 pr about build spark's own beeline client *spark-beeline*
> # Phase8 pr about spark's own CLI client code to support *Spark SQL CLI*
> module named *spark-cli*
> h3. Appendix A. Proposed API Changes. Optional section defining APIs changes,
> if any. Backward and forward compatibility must be taken into account.
> Compared to current `sql/hive-thriftserver`, corresponding API changes as
> below:
>
> # Add a new class org.apache.spark.sql.service.internal.ServiceConnf,
> contains all needed configuration for spark thrift server
> # ServiceSessionXxx as origin HiveSessionXxx
> # In ServiceSessionImpl, remove code spark won’t use
> # In ServiceSessionImpl set session conf directly to sqlConf like
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLSessionManager.scala#L67-L69]
> # Remove SparkSQLSessionManager, add logic into SessionMananger
> # Implement all OperationMananegr logic into SparkSQLOperationMananger and
> rename it to OperationManager
> # Add SQLContext to ServiceSessionImpl as it’s variable, don’t pass it by
> SparkSQLOperationManager, just get it by parentSession.getSqlContext()
> session conf was set to this sqlContext.sqlConf
> # Remove HiveServer2 since we don’t need the logic in it
> # Remove logic about hive impersonation since it won’t be useful in spark
> thrift server and remove parameter delegationTokenStr in
> ServiceSessionImplWithUGI
> [https://github.com/apache/spark/blob/18431c7baaba72539603814ef1757650000943d5/sql/hive-thriftserver/v2.3/src/main/java/org/apache/hive/service/cli/thrift/ThriftCLIService.java#L352-L353]
> we will use new way for spark’s impersonation.
> # Remove ThriftserverShimUtils, since we don’t need this
> # Remove SparkSQLCLIService just use CLIService
> # Remove ReflectionUtils and ReflactCompositeService since we don’t need
> interition and reflection
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]