Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li Sun, 15 Dec 2019 21:37:55 -0800

Thanks all for explaining.

I misunderstood the original proposal.
-1 to put them in our distributions
+1 to have provide hive uber jars as Seth and Aljoscha advice


Hive is just a connector no matter how important it is.
So I totally agree that we shouldn't put them in our distributions.
We can start offering three uber jars:
- flink-sql-connector-hive-1 (uber jar with hive dependent version 1.2.1)
- flink-sql-connector-hive-2 (uber jar with hive dependent version 2.3.4)
- flink-sql-connector-hive-3 (uber jar with hive dependent version 3.1.1)
My understanding is quite enough to users.

Best,
Jingsong Lee

On Sun, Dec 15, 2019 at 12:42 PM Jark Wu <imj...@gmail.com> wrote:

> I agree with Seth and Aljoscha and think that is a right way to go.
> We already provided uber jars for kafka and elasticsearch for out-of-box,
> you can see the download links in this page[1].
> Users can easily to download the connectors and versions they like and drag
> to SQL CLI lib directories. The uber jars
> contains all the dependencies required and may be shaded. In this way,
> users can skip to build a uber jar themselves.
> Hive is indeed a "connector" too, and should also follow this way.
>
> Best,
> Jark
>
> [1]:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>
> On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <aljos...@apache.org>
> wrote:
>
> > I was going to suggest the same thing as Seth. So yes, I’m against having
> > Flink distributions that contain Hive but for convenience downloads as we
> > have for Hadoop.
> >
> > Best,
> > Aljoscha
> >
> > > On 13. Dec 2019, at 18:04, Seth Wiesman <sjwies...@gmail.com> wrote:
> > >
> > > I'm also -1 on separate builds.
> > >
> > > What about publishing convenience jars that contain the dependencies
> for
> > > each version? For example, there could be a flink-hive-1.2.1-uber.jar
> > that
> > > users could just add to their lib folder that contains all the
> necessary
> > > dependencies to connect to that hive version.
> > >
> > >
> > > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <rmetz...@apache.org>
> > wrote:
> > >
> > >> I'm generally not opposed to convenience binaries, if a huge number of
> > >> people would benefit from them, and the overhead for the Flink project
> > is
> > >> low. I did not see a huge demand for such binaries yet (neither for
> the
> > >> Flink + Hive integration). Looking at Apache Spark, they are also only
> > >> offering convenience binaries for Hadoop only.
> > >>
> > >> Maybe we could provide a "Docker Playground" for Flink + Hive in the
> > >> documentation (and the flink-playgrounds.git repo)?
> > >> (similar to
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
> > >> )
> > >>
> > >>
> > >>
> > >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <ches...@apache.org>
> > >> wrote:
> > >>
> > >>> -1
> > >>>
> > >>> We shouldn't need to deploy additional binaries to have a feature be
> > >>> remotely usable.
> > >>> This usually points to something else being done incorrectly.
> > >>>
> > >>> If it is indeed such a hassle to setup hive on Flink, then my
> > conclusion
> > >>> would be that either
> > >>> a) the documentation needs to be improved
> > >>> b) the architecture needs to be improved
> > >>> or, if all else fails c) provide a utility script for setting it up
> > >> easier.
> > >>>
> > >>> We spent a lot of time on reducing the number of binaries in the
> hadoop
> > >>> days, and also go extra steps to prevent a separate Java 11 binary,
> and
> > >>> I see no reason why Hive should get special treatment on this matter.
> > >>>
> > >>> Regards,
> > >>> Chesnay
> > >>>
> > >>> On 13/12/2019 09:44, Bowen Li wrote:
> > >>>> Hi all,
> > >>>>
> > >>>> I want to propose to have a couple separate Flink distributions with
> > >> Hive
> > >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> > >>> distributions
> > >>>> will be provided to users on Flink download page [1].
> > >>>>
> > >>>> A few reasons to do this:
> > >>>>
> > >>>> 1) Flink-Hive integration is important to many many Flink and Hive
> > >> users
> > >>> in
> > >>>> two dimensions:
> > >>>>      a) for Flink metadata: HiveCatalog is the only persistent
> catalog
> > >>> to
> > >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the
> > >> persistent
> > >>>> catalog would be playing even more critical role in users' workflow
> > >>>>      b) for Flink data: Hive data connector (source/sink) helps both
> > >>> Flink
> > >>>> and Hive users to unlock new use cases in streaming,
> > >>> near-realtime/realtime
> > >>>> data warehouse, backfill, etc.
> > >>>>
> > >>>> 2) currently users have to go thru a *really* tedious process to get
> > >>>> started, because it requires lots of extra jars (see [2]) that are
> > >> absent
> > >>>> in Flink's lean distribution. We've had so many users from public
> > >> mailing
> > >>>> list, private email, DingTalk groups who got frustrated on spending
> > >> lots
> > >>> of
> > >>>> time figuring out the jars themselves. They would rather have a more
> > >>> "right
> > >>>> out of box" quickstart experience, and play with the catalog and
> > >>>> source/sink without hassle.
> > >>>>
> > >>>> 3) it's easier for users to replace those Hive dependencies for
> their
> > >> own
> > >>>> Hive versions - just replace those jars with the right versions and
> no
> > >>> need
> > >>>> to find the doc.
> > >>>>
> > >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
> > >> base
> > >>>> out there, and that's why we are using them as examples for
> > >> dependencies
> > >>> in
> > >>>> [1] even though we've supported almost all Hive versions [3] now.
> > >>>>
> > >>>> I want to hear what the community think about this, and how to
> achieve
> > >> it
> > >>>> if we believe that's the way to go.
> > >>>>
> > >>>> Cheers,
> > >>>> Bowen
> > >>>>
> > >>>> [1] https://flink.apache.org/downloads.html
> > >>>> [2]
> > >>>>
> > >>>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > >>>> [3]
> > >>>>
> > >>>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>


-- 
Best, Jingsong Lee

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Reply via email to