Re: Support Hive V2 DataSource in Kyuubi

Cheng Pan Wed, 17 Aug 2022 18:48:28 -0700

I see you have started reviewing on #3260, thank you for your participation.


Thanks,
Cheng Pan


On Aug 17, 2022 at 23:44:46, Heng Su <permanent.s...@gmail.com> wrote:

> Hi, Pan
>
> Sorry to reply late, it’s another busy day.
> It’s my pleasure to provide any help to kyuubi community.
> Fairly say, I havent deep into kyuubi yet and currently was focusing on
> another project.
> Since #3260 has been opened, I think we can follow this way to deliver the
> functionality in a short time.
>
>
> Best regards, Heng Su
>
>
>
> 2022年8月17日 下午4:07，Cheng Pan <pan3...@gmail.com> 写道：
>
> Thanks for sharing your experience.
>
> kyuubi default set kyuubi.engine.single.spark.session=true
>>
>
> It’s `false` in default.
>
> … provide concurrency sql execution in context isolation(I guess)
>>
>
> The concept of spark session is similar to JDBC/RDMS connection.
>
> This cause spark.newSession invoked for each transaction, although the
>> embedded sessionCatalog is shared across all spark session(include the new
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin
>> catalogs) will be created every time a sparkSession constructed, in this
>> case when concurrent query fires, the more dsv2 catalogs we used, the more
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test
>> for 256m metaspace, oom will occur.
>>
>
> Yea, this is the big different between v1 catalog and v2, maybe we can
> introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
>
> … a viewfs or router based federation must be configured in advance, I am
>> not sure if we can configure the hadoop conf separatly for each hive
>> catalog. a viewfs or router based federation must be configured in advance,
>> I am not sure if we can configure the hadoop conf separatly for each hive
>> catalog.
>>
>
> Maybe we can learning something from Iceberg.
>
> Since you already have a good shape of Hive DSv2 catalog implementation,
> and there are more and more people are interested in this feature, would
> you like contribute it to the Kyuubi project?
>
> Thanks,
> Cheng Pan
>
>
> On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com> wrote:
>
>> Hi, Cheng Pan
>>
>> Glad to join the session.
>>
>> The git repo you point out is truly used in our internal production etl
>> pipeline, of course currently not combine it with kyuubi.
>>
>> But I have the plan to refact it in two aspects:
>>
>> 1. As the spark3.3 released, most dsv2 functionality seems to be
>> production ready[1], and some api has changed since 3.1, maybe upgrade it
>> to this version is more stable
>> 2. We also have strong will to integrate kyuubi as spark sql query
>> engine, while currently the work is just in research.
>>    I have found some issue to integrate the hive-catalog extention with
>> kyuubi, for instance, kyuubi default set
>> `kyuubi.engine.single.spark.session`=true to provide concurrency sql
>> execution in context isolation(I guess),
>>    This cause spark.newSession invoked for each transaction, although the
>> embedded sessionCatalog is shared across all spark session(include the new
>> one), but in dsv2 architecture, the catalogManager(which hold all plugin
>> catalogs)
>>   will be created every time a sparkSession constructed, in this case
>> when concurrent query fires, the more dsv2 catalogs we used, the more
>> overhead(mainly in metaspace usage) the engine driver will hold, in my test
>> for 256m metaspace, oom will occur.
>>   Another one is currently the hive-catalog is based on that all the
>> target hadoop clusters can be visit by spark executin runtime, that say, a
>> viewfs or router based federation must be configured in advance, I am not
>> sure if we can configure the hadoop conf separatly for each hive catalog.
>> Similarly,  I just use SQLConf in sessionState as the global sqlConf of all
>> hive catalog, maybe in some case the default conf value will be different
>> in different catalog.
>>
>>
>> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
>>
>> 2022年8月16日 下午6:23，Cheng Pan <pan3...@gmail.com> 写道：
>>
>> Thanks for your idea.
>>
>> It's up to the community if Kyuubi will support this feature, if anyone
>> is interested in this feature, feel free to open PR for it, I'm happy to
>> review.
>>
>> In fact, I found that one guy (also +he as receiver) has done (probably
>> part of) the job [1], but I didn't test it, and I would appreciate if we
>> had a chance to collaborate.
>>
>> [1] https://github.com/permanentstar/spark-sql-dsv2-extension
>>
>> Thanks,
>> Cheng Pan
>>
>>
>> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com> wrote:
>>
>>> I'm also interested in it.
>>>
>>>
>>>
>>> Best Regards,
>>> Min Zhao
>>>
>>>
>>>
>>>
>>> ---- Replied Message ----
>>> | From | kaifei yi<yikaif...@gmail.com> |
>>> | Date | 08/16/2022 17:42 |
>>> | To | dev@kyuubi.apache.org<dev@kyuubi.apache.org> |
>>> | Cc | |
>>> | Subject | Support Hive V2 DataSource in Kyuubi |
>>> Hi, kyuubi community:
>>>
>>> Currently, Users are clamoring for the ability to federated queries in
>>> Lakehouse architecture,  we probably need a serval datasource to meet
>>> this.
>>>
>>> In practice, some user services need to access other hive warehouse for
>>> federated queries. currently, Apache Spark supports access to hive data
>>> sources. however, in federated scenarios, some capabilities may be
>>> disabled, for example, users may need to access different hive warehouse
>>> at
>>> the single job to perform federated query, and the hive versions are
>>> different, this requirement can be met by a hive V2 datasource
>>>
>>> Does the Kyuubi community have any idea how to include hive V2 in the
>>> feature list?
>>>
>>
>>
>

Re: Support Hive V2 DataSource in Kyuubi

Reply via email to