Re: Support Hive V2 DataSource in Kyuubi

Heng Su Wed, 17 Aug 2022 01:02:33 -0700

Hi, Cheng Pan

Glad to join the session.

The git repo you point out is truly used in our internal production etl 
pipeline, of course currently not combine it with kyuubi.

But I have the plan to refact it in two aspects:

1. As the spark3.3 released, most dsv2 functionality seems to be production 
ready[1], and some api has changed since 3.1, maybe upgrade it to this version 
is more stable
2. We also have strong will to integrate kyuubi as spark sql query engine, 
while currently the work is just in research.
   I have found some issue to integrate the hive-catalog extention with kyuubi, 
for instance, kyuubi default set `kyuubi.engine.single.spark.session`=true to 
provide concurrency sql execution in context isolation(I guess),
   This cause spark.newSession invoked for each transaction, although the 
embedded sessionCatalog is shared across all spark session(include the new 
one), but in dsv2 architecture, the catalogManager(which hold all plugin 
catalogs)
  will be created every time a sparkSession constructed, in this case when 
concurrent query fires, the more dsv2 catalogs we used, the more 
overhead(mainly in metaspace usage) the engine driver will hold, in my test for 
256m metaspace, oom will occur.
  Another one is currently the hive-catalog is based on that all the target 
hadoop clusters can be visit by spark executin runtime, that say, a viewfs or 
router based federation must be configured in advance, I am not sure if we can 
configure the hadoop conf separatly for each hive catalog. Similarly,  I just 
use SQLConf in sessionState as the global sqlConf of all hive catalog, maybe in 
some case the default conf value will be different in different catalog.

[1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA

> 2022年8月16日 下午6:23，Cheng Pan <pan3...@gmail.com> 写道：
> 
> Thanks for your idea.
> 
> It's up to the community if Kyuubi will support this feature, if anyone is 
> interested in this feature, feel free to open PR for it, I'm happy to review.
> 
> In fact, I found that one guy (also +he as receiver) has done (probably part 
> of) the job [1], but I didn't test it, and I would appreciate if we had a 
> chance to collaborate.
> 
> [1] https://github.com/permanentstar/spark-sql-dsv2-extension 
> <https://github.com/permanentstar/spark-sql-dsv2-extension>
> 
> Thanks,
> Cheng Pan
> 
> 
> On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com 
> <mailto:zhaomin1...@163.com>> wrote:
>> I'm also interested in it.
>> 
>> 
>> 
>> Best Regards,
>> Min Zhao
>> 
>> 
>> 
>> 
>> ---- Replied Message ----
>> | From | kaifei yi<yikaif...@gmail.com <mailto:yikaif...@gmail.com>> |
>> | Date | 08/16/2022 17:42 |
>> | To | dev@kyuubi.apache.org 
>> <mailto:dev@kyuubi.apache.org><dev@kyuubi.apache.org 
>> <mailto:dev@kyuubi.apache.org>> |
>> | Cc | |
>> | Subject | Support Hive V2 DataSource in Kyuubi |
>> Hi, kyuubi community:
>> 
>> Currently, Users are clamoring for the ability to federated queries in
>> Lakehouse architecture,  we probably need a serval datasource to meet this.
>> 
>> In practice, some user services need to access other hive warehouse for
>> federated queries. currently, Apache Spark supports access to hive data
>> sources. however, in federated scenarios, some capabilities may be
>> disabled, for example, users may need to access different hive warehouse at
>> the single job to perform federated query, and the hive versions are
>> different, this requirement can be met by a hive V2 datasource
>> 
>> Does the Kyuubi community have any idea how to include hive V2 in the
>> feature list?

Re: Support Hive V2 DataSource in Kyuubi

Reply via email to