Hi, Pan Sorry to reply late, it’s another busy day. It’s my pleasure to provide any help to kyuubi community. Fairly say, I havent deep into kyuubi yet and currently was focusing on another project. Since #3260 has been opened, I think we can follow this way to deliver the functionality in a short time.
Best regards, Heng Su > 2022年8月17日 下午4:07,Cheng Pan <pan3...@gmail.com> 写道: > > Thanks for sharing your experience. > >> kyuubi default set kyuubi.engine.single.spark.session=true > > It’s `false` in default. > >> … provide concurrency sql execution in context isolation(I guess) > > The concept of spark session is similar to JDBC/RDMS connection. > >> This cause spark.newSession invoked for each transaction, although the >> embedded sessionCatalog is shared across all spark session(include the new >> one), but in dsv2 architecture, the catalogManager(which hold all plugin >> catalogs) will be created every time a sparkSession constructed, in this >> case when concurrent query fires, the more dsv2 catalogs we used, the more >> overhead(mainly in metaspace usage) the engine driver will hold, in my test >> for 256m metaspace, oom will occur. > > Yea, this is the big different between v1 catalog and v2, maybe we can > introduce a cache mechanism to reduce the overhead, i.e. hive client pool. > >> … a viewfs or router based federation must be configured in advance, I am >> not sure if we can configure the hadoop conf separatly for each hive >> catalog. a viewfs or router based federation must be configured in advance, >> I am not sure if we can configure the hadoop conf separatly for each hive >> catalog. > > Maybe we can learning something from Iceberg. > > Since you already have a good shape of Hive DSv2 catalog implementation, and > there are more and more people are interested in this feature, would you like > contribute it to the Kyuubi project? > > Thanks, > Cheng Pan > > > On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com > <mailto:permanent.s...@gmail.com>> wrote: >> Hi, Cheng Pan >> >> Glad to join the session. >> >> The git repo you point out is truly used in our internal production etl >> pipeline, of course currently not combine it with kyuubi. >> >> But I have the plan to refact it in two aspects: >> >> 1. As the spark3.3 released, most dsv2 functionality seems to be production >> ready[1], and some api has changed since 3.1, maybe upgrade it to this >> version is more stable >> 2. We also have strong will to integrate kyuubi as spark sql query engine, >> while currently the work is just in research. >> I have found some issue to integrate the hive-catalog extention with >> kyuubi, for instance, kyuubi default set >> `kyuubi.engine.single.spark.session`=true to provide concurrency sql >> execution in context isolation(I guess), >> This cause spark.newSession invoked for each transaction, although the >> embedded sessionCatalog is shared across all spark session(include the new >> one), but in dsv2 architecture, the catalogManager(which hold all plugin >> catalogs) >> will be created every time a sparkSession constructed, in this case when >> concurrent query fires, the more dsv2 catalogs we used, the more >> overhead(mainly in metaspace usage) the engine driver will hold, in my test >> for 256m metaspace, oom will occur. >> Another one is currently the hive-catalog is based on that all the target >> hadoop clusters can be visit by spark executin runtime, that say, a viewfs >> or router based federation must be configured in advance, I am not sure if >> we can configure the hadoop conf separatly for each hive catalog. Similarly, >> I just use SQLConf in sessionState as the global sqlConf of all hive >> catalog, maybe in some case the default conf value will be different in >> different catalog. >> >> >> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA >> <https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA> >> >>> 2022年8月16日 下午6:23,Cheng Pan <pan3...@gmail.com <mailto:pan3...@gmail.com>> >>> 写道: >>> >>> Thanks for your idea. >>> >>> It's up to the community if Kyuubi will support this feature, if anyone is >>> interested in this feature, feel free to open PR for it, I'm happy to >>> review. >>> >>> In fact, I found that one guy (also +he as receiver) has done (probably >>> part of) the job [1], but I didn't test it, and I would appreciate if we >>> had a chance to collaborate. >>> >>> [1] https://github.com/permanentstar/spark-sql-dsv2-extension >>> <https://github.com/permanentstar/spark-sql-dsv2-extension> >>> >>> Thanks, >>> Cheng Pan >>> >>> >>> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com >>> <mailto:zhaomin1...@163.com>> wrote: >>>> I'm also interested in it. >>>> >>>> >>>> >>>> Best Regards, >>>> Min Zhao >>>> >>>> >>>> >>>> >>>> ---- Replied Message ---- >>>> | From | kaifei yi<yikaif...@gmail.com <mailto:yikaif...@gmail.com>> | >>>> | Date | 08/16/2022 17:42 | >>>> | To | dev@kyuubi.apache.org >>>> <mailto:dev@kyuubi.apache.org><dev@kyuubi.apache.org >>>> <mailto:dev@kyuubi.apache.org>> | >>>> | Cc | | >>>> | Subject | Support Hive V2 DataSource in Kyuubi | >>>> Hi, kyuubi community: >>>> >>>> Currently, Users are clamoring for the ability to federated queries in >>>> Lakehouse architecture, we probably need a serval datasource to meet this. >>>> >>>> In practice, some user services need to access other hive warehouse for >>>> federated queries. currently, Apache Spark supports access to hive data >>>> sources. however, in federated scenarios, some capabilities may be >>>> disabled, for example, users may need to access different hive warehouse at >>>> the single job to perform federated query, and the hive versions are >>>> different, this requirement can be met by a hive V2 datasource >>>> >>>> Does the Kyuubi community have any idea how to include hive V2 in the >>>> feature list? >>