I see you have started reviewing on #3260, thank you for your participation.
Thanks, Cheng Pan On Aug 17, 2022 at 23:44:46, Heng Su <permanent.s...@gmail.com> wrote: > Hi, Pan > > Sorry to reply late, it’s another busy day. > It’s my pleasure to provide any help to kyuubi community. > Fairly say, I havent deep into kyuubi yet and currently was focusing on > another project. > Since #3260 has been opened, I think we can follow this way to deliver the > functionality in a short time. > > > Best regards, Heng Su > > > > 2022年8月17日 下午4:07,Cheng Pan <pan3...@gmail.com> 写道: > > Thanks for sharing your experience. > > kyuubi default set kyuubi.engine.single.spark.session=true >> > > It’s `false` in default. > > … provide concurrency sql execution in context isolation(I guess) >> > > The concept of spark session is similar to JDBC/RDMS connection. > > This cause spark.newSession invoked for each transaction, although the >> embedded sessionCatalog is shared across all spark session(include the new >> one), but in dsv2 architecture, the catalogManager(which hold all plugin >> catalogs) will be created every time a sparkSession constructed, in this >> case when concurrent query fires, the more dsv2 catalogs we used, the more >> overhead(mainly in metaspace usage) the engine driver will hold, in my test >> for 256m metaspace, oom will occur. >> > > Yea, this is the big different between v1 catalog and v2, maybe we can > introduce a cache mechanism to reduce the overhead, i.e. hive client pool. > > … a viewfs or router based federation must be configured in advance, I am >> not sure if we can configure the hadoop conf separatly for each hive >> catalog. a viewfs or router based federation must be configured in advance, >> I am not sure if we can configure the hadoop conf separatly for each hive >> catalog. >> > > Maybe we can learning something from Iceberg. > > Since you already have a good shape of Hive DSv2 catalog implementation, > and there are more and more people are interested in this feature, would > you like contribute it to the Kyuubi project? > > Thanks, > Cheng Pan > > > On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com> wrote: > >> Hi, Cheng Pan >> >> Glad to join the session. >> >> The git repo you point out is truly used in our internal production etl >> pipeline, of course currently not combine it with kyuubi. >> >> But I have the plan to refact it in two aspects: >> >> 1. As the spark3.3 released, most dsv2 functionality seems to be >> production ready[1], and some api has changed since 3.1, maybe upgrade it >> to this version is more stable >> 2. We also have strong will to integrate kyuubi as spark sql query >> engine, while currently the work is just in research. >> I have found some issue to integrate the hive-catalog extention with >> kyuubi, for instance, kyuubi default set >> `kyuubi.engine.single.spark.session`=true to provide concurrency sql >> execution in context isolation(I guess), >> This cause spark.newSession invoked for each transaction, although the >> embedded sessionCatalog is shared across all spark session(include the new >> one), but in dsv2 architecture, the catalogManager(which hold all plugin >> catalogs) >> will be created every time a sparkSession constructed, in this case >> when concurrent query fires, the more dsv2 catalogs we used, the more >> overhead(mainly in metaspace usage) the engine driver will hold, in my test >> for 256m metaspace, oom will occur. >> Another one is currently the hive-catalog is based on that all the >> target hadoop clusters can be visit by spark executin runtime, that say, a >> viewfs or router based federation must be configured in advance, I am not >> sure if we can configure the hadoop conf separatly for each hive catalog. >> Similarly, I just use SQLConf in sessionState as the global sqlConf of all >> hive catalog, maybe in some case the default conf value will be different >> in different catalog. >> >> >> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA >> >> 2022年8月16日 下午6:23,Cheng Pan <pan3...@gmail.com> 写道: >> >> Thanks for your idea. >> >> It's up to the community if Kyuubi will support this feature, if anyone >> is interested in this feature, feel free to open PR for it, I'm happy to >> review. >> >> In fact, I found that one guy (also +he as receiver) has done (probably >> part of) the job [1], but I didn't test it, and I would appreciate if we >> had a chance to collaborate. >> >> [1] https://github.com/permanentstar/spark-sql-dsv2-extension >> >> Thanks, >> Cheng Pan >> >> >> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com> wrote: >> >>> I'm also interested in it. >>> >>> >>> >>> Best Regards, >>> Min Zhao >>> >>> >>> >>> >>> ---- Replied Message ---- >>> | From | kaifei yi<yikaif...@gmail.com> | >>> | Date | 08/16/2022 17:42 | >>> | To | dev@kyuubi.apache.org<dev@kyuubi.apache.org> | >>> | Cc | | >>> | Subject | Support Hive V2 DataSource in Kyuubi | >>> Hi, kyuubi community: >>> >>> Currently, Users are clamoring for the ability to federated queries in >>> Lakehouse architecture, we probably need a serval datasource to meet >>> this. >>> >>> In practice, some user services need to access other hive warehouse for >>> federated queries. currently, Apache Spark supports access to hive data >>> sources. however, in federated scenarios, some capabilities may be >>> disabled, for example, users may need to access different hive warehouse >>> at >>> the single job to perform federated query, and the hive versions are >>> different, this requirement can be met by a hive V2 datasource >>> >>> Does the Kyuubi community have any idea how to include hive V2 in the >>> feature list? >>> >> >> >