Hi everyone There is an issue[1] and an initial implementation pr[2] for this discussion, please go to Github for further discussion and review
[1] https://github.com/apache/incubator-kyuubi/issues/3259 [2] https://github.com/apache/incubator-kyuubi/pull/3260 Thanks Cheng Pan <pan3...@gmail.com> 于2022年8月17日周三 16:07写道: > Thanks for sharing your experience. > > kyuubi default set kyuubi.engine.single.spark.session=true > > > > It’s `false` in default. > > … provide concurrency sql execution in context isolation(I guess) > > > > The concept of spark session is similar to JDBC/RDMS connection. > > This cause spark.newSession invoked for each transaction, although the > > embedded sessionCatalog is shared across all spark session(include the > new > > one), but in dsv2 architecture, the catalogManager(which hold all plugin > > catalogs) will be created every time a sparkSession constructed, in this > > case when concurrent query fires, the more dsv2 catalogs we used, the > more > > overhead(mainly in metaspace usage) the engine driver will hold, in my > test > > for 256m metaspace, oom will occur. > > > > Yea, this is the big different between v1 catalog and v2, maybe we can > introduce a cache mechanism to reduce the overhead, i.e. hive client pool. > > … a viewfs or router based federation must be configured in advance, I am > > not sure if we can configure the hadoop conf separatly for each hive > > catalog. a viewfs or router based federation must be configured in > advance, > > I am not sure if we can configure the hadoop conf separatly for each hive > > catalog. > > > > Maybe we can learning something from Iceberg. > > Since you already have a good shape of Hive DSv2 catalog implementation, > and there are more and more people are interested in this feature, would > you like contribute it to the Kyuubi project? > > Thanks, > Cheng Pan > > > On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com> wrote: > > > Hi, Cheng Pan > > > > Glad to join the session. > > > > The git repo you point out is truly used in our internal production etl > > pipeline, of course currently not combine it with kyuubi. > > > > But I have the plan to refact it in two aspects: > > > > 1. As the spark3.3 released, most dsv2 functionality seems to be > > production ready[1], and some api has changed since 3.1, maybe upgrade it > > to this version is more stable > > 2. We also have strong will to integrate kyuubi as spark sql query > engine, > > while currently the work is just in research. > > I have found some issue to integrate the hive-catalog extention with > > kyuubi, for instance, kyuubi default set > > `kyuubi.engine.single.spark.session`=true to provide concurrency sql > > execution in context isolation(I guess), > > This cause spark.newSession invoked for each transaction, although the > > embedded sessionCatalog is shared across all spark session(include the > new > > one), but in dsv2 architecture, the catalogManager(which hold all plugin > > catalogs) > > will be created every time a sparkSession constructed, in this case > when > > concurrent query fires, the more dsv2 catalogs we used, the more > > overhead(mainly in metaspace usage) the engine driver will hold, in my > test > > for 256m metaspace, oom will occur. > > Another one is currently the hive-catalog is based on that all the > > target hadoop clusters can be visit by spark executin runtime, that say, > a > > viewfs or router based federation must be configured in advance, I am not > > sure if we can configure the hadoop conf separatly for each hive catalog. > > Similarly, I just use SQLConf in sessionState as the global sqlConf of > all > > hive catalog, maybe in some case the default conf value will be different > > in different catalog. > > > > > > [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA > > > > 2022年8月16日 下午6:23,Cheng Pan <pan3...@gmail.com> 写道: > > > > Thanks for your idea. > > > > It's up to the community if Kyuubi will support this feature, if anyone > is > > interested in this feature, feel free to open PR for it, I'm happy to > > review. > > > > In fact, I found that one guy (also +he as receiver) has done (probably > > part of) the job [1], but I didn't test it, and I would appreciate if we > > had a chance to collaborate. > > > > [1] https://github.com/permanentstar/spark-sql-dsv2-extension > > > > Thanks, > > Cheng Pan > > > > > > On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com> wrote: > > > >> I'm also interested in it. > >> > >> > >> > >> Best Regards, > >> Min Zhao > >> > >> > >> > >> > >> ---- Replied Message ---- > >> | From | kaifei yi<yikaif...@gmail.com> | > >> | Date | 08/16/2022 17:42 | > >> | To | dev@kyuubi.apache.org<dev@kyuubi.apache.org> | > >> | Cc | | > >> | Subject | Support Hive V2 DataSource in Kyuubi | > >> Hi, kyuubi community: > >> > >> Currently, Users are clamoring for the ability to federated queries in > >> Lakehouse architecture, we probably need a serval datasource to meet > >> this. > >> > >> In practice, some user services need to access other hive warehouse for > >> federated queries. currently, Apache Spark supports access to hive data > >> sources. however, in federated scenarios, some capabilities may be > >> disabled, for example, users may need to access different hive warehouse > >> at > >> the single job to perform federated query, and the hive versions are > >> different, this requirement can be met by a hive V2 datasource > >> > >> Does the Kyuubi community have any idea how to include hive V2 in the > >> feature list? > >> > > > > >