Re: Support Hive V2 DataSource in Kyuubi

kaifei yi Wed, 17 Aug 2022 04:31:55 -0700

Hi everyone

There is an issue[1] and an initial implementation pr[2] for this
discussion, please go to Github for further discussion and review


[1] https://github.com/apache/incubator-kyuubi/issues/3259
[2] https://github.com/apache/incubator-kyuubi/pull/3260

Thanks

Cheng Pan <pan3...@gmail.com> 于2022年8月17日周三 16:07写道：

> Thanks for sharing your experience.
>
> kyuubi default set kyuubi.engine.single.spark.session=true
> >
>
> It’s `false` in default.
>
> … provide concurrency sql execution in context isolation(I guess)
> >
>
> The concept of spark session is similar to JDBC/RDMS connection.
>
> This cause spark.newSession invoked for each transaction, although the
> > embedded sessionCatalog is shared across all spark session(include the
> new
> > one), but in dsv2 architecture, the catalogManager(which hold all plugin
> > catalogs) will be created every time a sparkSession constructed, in this
> > case when concurrent query fires, the more dsv2 catalogs we used, the
> more
> > overhead(mainly in metaspace usage) the engine driver will hold, in my
> test
> > for 256m metaspace, oom will occur.
> >
>
> Yea, this is the big different between v1 catalog and v2, maybe we can
> introduce a cache mechanism to reduce the overhead, i.e. hive client pool.
>
> … a viewfs or router based federation must be configured in advance, I am
> > not sure if we can configure the hadoop conf separatly for each hive
> > catalog. a viewfs or router based federation must be configured in
> advance,
> > I am not sure if we can configure the hadoop conf separatly for each hive
> > catalog.
> >
>
> Maybe we can learning something from Iceberg.
>
> Since you already have a good shape of Hive DSv2 catalog implementation,
> and there are more and more people are interested in this feature, would
> you like contribute it to the Kyuubi project?
>
> Thanks,
> Cheng Pan
>
>
> On Aug 17, 2022 at 11:23:07, Heng Su <permanent.s...@gmail.com> wrote:
>
> > Hi, Cheng Pan
> >
> > Glad to join the session.
> >
> > The git repo you point out is truly used in our internal production etl
> > pipeline, of course currently not combine it with kyuubi.
> >
> > But I have the plan to refact it in two aspects:
> >
> > 1. As the spark3.3 released, most dsv2 functionality seems to be
> > production ready[1], and some api has changed since 3.1, maybe upgrade it
> > to this version is more stable
> > 2. We also have strong will to integrate kyuubi as spark sql query
> engine,
> > while currently the work is just in research.
> >    I have found some issue to integrate the hive-catalog extention with
> > kyuubi, for instance, kyuubi default set
> > `kyuubi.engine.single.spark.session`=true to provide concurrency sql
> > execution in context isolation(I guess),
> >    This cause spark.newSession invoked for each transaction, although the
> > embedded sessionCatalog is shared across all spark session(include the
> new
> > one), but in dsv2 architecture, the catalogManager(which hold all plugin
> > catalogs)
> >   will be created every time a sparkSession constructed, in this case
> when
> > concurrent query fires, the more dsv2 catalogs we used, the more
> > overhead(mainly in metaspace usage) the engine driver will hold, in my
> test
> > for 256m metaspace, oom will occur.
> >   Another one is currently the hive-catalog is based on that all the
> > target hadoop clusters can be visit by spark executin runtime, that say,
> a
> > viewfs or router based federation must be configured in advance, I am not
> > sure if we can configure the hadoop conf separatly for each hive catalog.
> > Similarly,  I just use SQLConf in sessionState as the global sqlConf of
> all
> > hive catalog, maybe in some case the default conf value will be different
> > in different catalog.
> >
> >
> > [1] https://mp.weixin.qq.com/s/DJOIrRddCr7vYKEGGg0WyA
> >
> > 2022年8月16日 下午6:23，Cheng Pan <pan3...@gmail.com> 写道：
> >
> > Thanks for your idea.
> >
> > It's up to the community if Kyuubi will support this feature, if anyone
> is
> > interested in this feature, feel free to open PR for it, I'm happy to
> > review.
> >
> > In fact, I found that one guy (also +he as receiver) has done (probably
> > part of) the job [1], but I didn't test it, and I would appreciate if we
> > had a chance to collaborate.
> >
> > [1] https://github.com/permanentstar/spark-sql-dsv2-extension
> >
> > Thanks,
> > Cheng Pan
> >
> >
> > On Aug 16, 2022 at 17:57:35, zhaomin <zhaomin1...@163.com> wrote:
> >
> >> I'm also interested in it.
> >>
> >>
> >>
> >> Best Regards,
> >> Min Zhao
> >>
> >>
> >>
> >>
> >> ---- Replied Message ----
> >> | From | kaifei yi<yikaif...@gmail.com> |
> >> | Date | 08/16/2022 17:42 |
> >> | To | dev@kyuubi.apache.org<dev@kyuubi.apache.org> |
> >> | Cc | |
> >> | Subject | Support Hive V2 DataSource in Kyuubi |
> >> Hi, kyuubi community:
> >>
> >> Currently, Users are clamoring for the ability to federated queries in
> >> Lakehouse architecture,  we probably need a serval datasource to meet
> >> this.
> >>
> >> In practice, some user services need to access other hive warehouse for
> >> federated queries. currently, Apache Spark supports access to hive data
> >> sources. however, in federated scenarios, some capabilities may be
> >> disabled, for example, users may need to access different hive warehouse
> >> at
> >> the single job to perform federated query, and the hive versions are
> >> different, this requirement can be met by a hive V2 datasource
> >>
> >> Does the Kyuubi community have any idea how to include hive V2 in the
> >> feature list?
> >>
> >
> >
>

Re: Support Hive V2 DataSource in Kyuubi

Reply via email to