Hi, Cheng Pan Glad to join the session.
The git repo you point out is truly used in our internal production etl pipeline, of course currently not combine it with kyuubi. But I have the plan to refact it in two aspects: 1. As the spark3.3 released, most dsv2 functionality seems to be production ready[1], and some api has changed since 3.1, maybe upgrade it to this version is more stable 2. We also have strong will to integrate kyuubi as spark sql query engine, while currently the work is just in research. I have found some issue to integrate the hive-catalog extention with kyuubi, for instance, kyuubi default set `kyuubi.engine.single.spark.session`=true to provide concurrency sql execution in context isolation(I guess), This cause spark.newSession invoked for each transaction, although the embedded sessionCatalog is shared across all spark session(include the new one), but in dsv2 architecture, the catalogManager(which hold all plugin catalogs) will be created every time a sparkSession constructed, in this case when concurrent query fires, the more dsv2 catalogs we used, the more overhead(mainly in metaspace usage) the engine driver will hold, in my test for 256m metaspace, oom will occur. Another one is currently the hive-catalog is based on that all the target hadoop clusters can be visit by spark executin runtime, that say, a viewfs or router based federation must be configured in advance, I am not sure if we can configure the hadoop conf separatly for each hive catalog. Similarly, I just use SQLConf in sessionState as the global sqlConf of all hive catalog, maybe in some case the default conf value will be different in different catalog. [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA > 2022年8月16日 下午6:23,Cheng Pan <pan3...@gmail.com> 写道: > > Thanks for your idea. > > It's up to the community if Kyuubi will support this feature, if anyone is > interested in this feature, feel free to open PR for it, I'm happy to review. > > In fact, I found that one guy (also +he as receiver) has done (probably part > of) the job [1], but I didn't test it, and I would appreciate if we had a > chance to collaborate. > > [1] https://github.com/permanentstar/spark-sql-dsv2-extension > <https://github.com/permanentstar/spark-sql-dsv2-extension> > > Thanks, > Cheng Pan > > > On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com > <mailto:zhaomin1...@163.com>> wrote: >> I'm also interested in it. >> >> >> >> Best Regards, >> Min Zhao >> >> >> >> >> ---- Replied Message ---- >> | From | kaifei yi<yikaif...@gmail.com <mailto:yikaif...@gmail.com>> | >> | Date | 08/16/2022 17:42 | >> | To | dev@kyuubi.apache.org >> <mailto:dev@kyuubi.apache.org><dev@kyuubi.apache.org >> <mailto:dev@kyuubi.apache.org>> | >> | Cc | | >> | Subject | Support Hive V2 DataSource in Kyuubi | >> Hi, kyuubi community: >> >> Currently, Users are clamoring for the ability to federated queries in >> Lakehouse architecture, we probably need a serval datasource to meet this. >> >> In practice, some user services need to access other hive warehouse for >> federated queries. currently, Apache Spark supports access to hive data >> sources. however, in federated scenarios, some capabilities may be >> disabled, for example, users may need to access different hive warehouse at >> the single job to perform federated query, and the hive versions are >> different, this requirement can be met by a hive V2 datasource >> >> Does the Kyuubi community have any idea how to include hive V2 in the >> feature list?