Re: Support Hive V2 DataSource in Kyuubi

Heng Su Wed, 17 Aug 2022 08:45:28 -0700

Hi, Pan

Sorry to reply late, it’s another busy day.
It’s my pleasure to provide any help to kyuubi community.
Fairly say, I havent deep into kyuubi yet and currently was focusing on another 
project.
Since #3260 has been opened, I think we can follow this way to deliver the 
functionality in a short time.



Best regards, Heng Su



> 2022年8月17日 下午4:07，Cheng Pan <pan3...@gmail.com> 写道：
> 
> Thanks for sharing your experience.
> 
>> kyuubi default set kyuubi.engine.single.spark.session=true
> 
> It’s `false` in default.
> 
>> … provide concurrency sql execution in context isolation(I guess)
> 
> The concept of spark session is similar to JDBC/RDMS connection.
> 
>> This cause spark.newSession invoked for each transaction, although the 
>> embedded sessionCatalog is shared across all spark session(include the new 
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin 
>> catalogs) will be created every time a sparkSession constructed, in this 
>> case when concurrent query fires, the more dsv2 catalogs we used, the more 
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test 
>> for 256m metaspace, oom will occur.
> 
> Yea, this is the big different between v1 catalog and v2, maybe we can 
> introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
> 
>> … a viewfs or router based federation must be configured in advance, I am 
>> not sure if we can configure the hadoop conf separatly for each hive 
>> catalog. a viewfs or router based federation must be configured in advance, 
>> I am not sure if we can configure the hadoop conf separatly for each hive 
>> catalog.
> 
> Maybe we can learning something from Iceberg.
> 
> Since you already have a good shape of Hive DSv2 catalog implementation, and 
> there are more and more people are interested in this feature, would you like 
> contribute it to the Kyuubi project?
> 
> Thanks,
> Cheng Pan
> 
> 
> On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com 
> <mailto:permanent.s...@gmail.com>> wrote:
>> Hi, Cheng Pan
>> 
>> Glad to join the session.
>> 
>> The git repo you point out is truly used in our internal production etl 
>> pipeline, of course currently not combine it with kyuubi.
>> 
>> But I have the plan to refact it in two aspects:
>> 
>> 1. As the spark3.3 released, most dsv2 functionality seems to be production 
>> ready[1], and some api has changed since 3.1, maybe upgrade it to this 
>> version is more stable
>> 2. We also have strong will to integrate kyuubi as spark sql query engine, 
>> while currently the work is just in research.
>>    I have found some issue to integrate the hive-catalog extention with 
>> kyuubi, for instance, kyuubi default set 
>> `kyuubi.engine.single.spark.session`=true to provide concurrency sql 
>> execution in context isolation(I guess),
>>    This cause spark.newSession invoked for each transaction, although the 
>> embedded sessionCatalog is shared across all spark session(include the new 
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin 
>> catalogs)
>>   will be created every time a sparkSession constructed, in this case when 
>> concurrent query fires, the more dsv2 catalogs we used, the more 
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test 
>> for 256m metaspace, oom will occur.
>>   Another one is currently the hive-catalog is based on that all the target 
>> hadoop clusters can be visit by spark executin runtime, that say, a viewfs 
>> or router based federation must be configured in advance, I am not sure if 
>> we can configure the hadoop conf separatly for each hive catalog. Similarly, 
>>  I just use SQLConf in sessionState as the global sqlConf of all hive 
>> catalog, maybe in some case the default conf value will be different in 
>> different catalog.
>> 
>> 
>> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA 
>> <https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA>
>> 
>>> 2022年8月16日 下午6:23，Cheng Pan <pan3...@gmail.com <mailto:pan3...@gmail.com>> 
>>> 写道：
>>> 
>>> Thanks for your idea.
>>> 
>>> It's up to the community if Kyuubi will support this feature, if anyone is 
>>> interested in this feature, feel free to open PR for it, I'm happy to 
>>> review.
>>> 
>>> In fact, I found that one guy (also +he as receiver) has done (probably 
>>> part of) the job [1], but I didn't test it, and I would appreciate if we 
>>> had a chance to collaborate.
>>> 
>>> [1] https://github.com/permanentstar/spark-sql-dsv2-extension 
>>> <https://github.com/permanentstar/spark-sql-dsv2-extension>
>>> 
>>> Thanks,
>>> Cheng Pan
>>> 
>>> 
>>> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com 
>>> <mailto:zhaomin1...@163.com>> wrote:
>>>> I'm also interested in it.
>>>> 
>>>> 
>>>> 
>>>> Best Regards,
>>>> Min Zhao
>>>> 
>>>> 
>>>> 
>>>> 
>>>> ---- Replied Message ----
>>>> | From | kaifei yi<yikaif...@gmail.com <mailto:yikaif...@gmail.com>> |
>>>> | Date | 08/16/2022 17:42 |
>>>> | To | dev@kyuubi.apache.org 
>>>> <mailto:dev@kyuubi.apache.org><dev@kyuubi.apache.org 
>>>> <mailto:dev@kyuubi.apache.org>> |
>>>> | Cc | |
>>>> | Subject | Support Hive V2 DataSource in Kyuubi |
>>>> Hi, kyuubi community:
>>>> 
>>>> Currently, Users are clamoring for the ability to federated queries in
>>>> Lakehouse architecture,  we probably need a serval datasource to meet this.
>>>> 
>>>> In practice, some user services need to access other hive warehouse for
>>>> federated queries. currently, Apache Spark supports access to hive data
>>>> sources. however, in federated scenarios, some capabilities may be
>>>> disabled, for example, users may need to access different hive warehouse at
>>>> the single job to perform federated query, and the hive versions are
>>>> different, this requirement can be met by a hive V2 datasource
>>>> 
>>>> Does the Kyuubi community have any idea how to include hive V2 in the
>>>> feature list?
>>

Re: Support Hive V2 DataSource in Kyuubi

Reply via email to