Re: Support Hive V2 DataSource in Kyuubi

Cheng Pan Wed, 17 Aug 2022 01:07:27 -0700

Thanks for sharing your experience.

kyuubi default set kyuubi.engine.single.spark.session=true
>


It’s `false` in default.

… provide concurrency sql execution in context isolation(I guess)
>

The concept of spark session is similar to JDBC/RDMS connection.

This cause spark.newSession invoked for each transaction, although the
> embedded sessionCatalog is shared across all spark session(include the new
> one), but in dsv2 architecture, the catalogManager(which hold all plugin
> catalogs) will be created every time a sparkSession constructed, in this
> case when concurrent query fires, the more dsv2 catalogs we used, the more
> overhead(mainly in metaspace usage) the engine driver will hold, in my test
> for 256m metaspace, oom will occur.
>

Yea, this is the big different between v1 catalog and v2, maybe we can
introduce a cache mechanism to reduce the overhead, i.e. hive client pool.

… a viewfs or router based federation must be configured in advance, I am
> not sure if we can configure the hadoop conf separatly for each hive
> catalog. a viewfs or router based federation must be configured in advance,
> I am not sure if we can configure the hadoop conf separatly for each hive
> catalog.
>

Maybe we can learning something from Iceberg.

Since you already have a good shape of Hive DSv2 catalog implementation,
and there are more and more people are interested in this feature, would
you like contribute it to the Kyuubi project?

Thanks,
Cheng Pan


On Aug 17, 2022 at 11:23:07, Heng Su <[email protected]> wrote:

> Hi, Cheng Pan
>
> Glad to join the session.
>
> The git repo you point out is truly used in our internal production etl
> pipeline, of course currently not combine it with kyuubi.
>
> But I have the plan to refact it in two aspects:
>
> 1. As the spark3.3 released, most dsv2 functionality seems to be
> production ready[1], and some api has changed since 3.1, maybe upgrade it
> to this version is more stable
> 2. We also have strong will to integrate kyuubi as spark sql query engine,
> while currently the work is just in research.
>    I have found some issue to integrate the hive-catalog extention with
> kyuubi, for instance, kyuubi default set
> `kyuubi.engine.single.spark.session`=true to provide concurrency sql
> execution in context isolation(I guess),
>    This cause spark.newSession invoked for each transaction, although the
> embedded sessionCatalog is shared across all spark session(include the new
> one), but in dsv2 architecture, the catalogManager(which hold all plugin
> catalogs)
>   will be created every time a sparkSession constructed, in this case when
> concurrent query fires, the more dsv2 catalogs we used, the more
> overhead(mainly in metaspace usage) the engine driver will hold, in my test
> for 256m metaspace, oom will occur.
>   Another one is currently the hive-catalog is based on that all the
> target hadoop clusters can be visit by spark executin runtime, that say, a
> viewfs or router based federation must be configured in advance, I am not
> sure if we can configure the hadoop conf separatly for each hive catalog.
> Similarly,  I just use SQLConf in sessionState as the global sqlConf of all
> hive catalog, maybe in some case the default conf value will be different
> in different catalog.
>
>
> [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
>
> 2022年8月16日 下午6:23，Cheng Pan <[email protected]> 写道：
>
> Thanks for your idea.
>
> It's up to the community if Kyuubi will support this feature, if anyone is
> interested in this feature, feel free to open PR for it, I'm happy to
> review.
>
> In fact, I found that one guy (also +he as receiver) has done (probably
> part of) the job [1], but I didn't test it, and I would appreciate if we
> had a chance to collaborate.
>
> [1] https://github.com/permanentstar/spark-sql-dsv2-extension
>
> Thanks,
> Cheng Pan
>
>
> On Aug 16, 2022 at 17:57:35, zhaomin <[email protected]> wrote:
>
>> I'm also interested in it.
>>
>>
>>
>> Best Regards,
>> Min Zhao
>>
>>
>>
>>
>> ---- Replied Message ----
>> | From | kaifei yi<[email protected]> |
>> | Date | 08/16/2022 17:42 |
>> | To | [email protected]<[email protected]> |
>> | Cc | |
>> | Subject | Support Hive V2 DataSource in Kyuubi |
>> Hi, kyuubi community:
>>
>> Currently, Users are clamoring for the ability to federated queries in
>> Lakehouse architecture,  we probably need a serval datasource to meet
>> this.
>>
>> In practice, some user services need to access other hive warehouse for
>> federated queries. currently, Apache Spark supports access to hive data
>> sources. however, in federated scenarios, some capabilities may be
>> disabled, for example, users may need to access different hive warehouse
>> at
>> the single job to perform federated query, and the hive versions are
>> different, this requirement can be met by a hive V2 datasource
>>
>> Does the Kyuubi community have any idea how to include hive V2 in the
>> feature list?
>>
>
>

Re: Support Hive V2 DataSource in Kyuubi

Reply via email to