Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Aljoscha Krettek Thu, 16 Apr 2020 02:02:50 -0700

I want to reinforce my opinion from earlier: This is about improving thesituation both for first-time users and for experienced users that wantto use a Flink dist in production. The current Flink dist is too "thin"for first-time SQL users and it is too "fat" for production users, thatis where serving no-one properly with the current middle-ground. That'swhy I think introducing those specialized "spins" of Flink dist would begood.

By the way, at some point in the future production users might not evenneed to get a Flink dist anymore. They should be able to have Flink as adependency of their project (including the runtime) and then build animage from this for Kubernetes or a fat jar for YARN.


Aljoscha

On 15.04.20 18:14, wenlong.lwl wrote:

Hi all,

Regarding slim and fat distributions, I think different kinds of jobs may
prefer different type of distribution:

For DataStream job, I think we may not like fat distribution containing
connectors because user would always need to depend on the connector in
user code, it is easy to include the connector jar in the user lib. Less
jar in lib means less class conflicts and problems.

For SQL job, I think we are trying to encourage user to user pure sql(DDL +
DML) to construct their job, In order to improve user experience, It may be
important for flink, not only providing as many connector jar in
distribution as possible especially the connector and format we have well
documented,  but also providing an mechanism to load connectors according
to the DDLs,

So I think it could be good to place connector/format jars in some dir like
opt/connector which would not affect jobs by default, and introduce a
mechanism of dynamic discovery for SQL.

Best,
Wenlong

On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> wrote:

Hi,

I am thinking both "improve first experience" and "improve production
experience".

I'm thinking about what's the common mode of Flink?
Streaming job use Kafka? Batch job use Hive?

Hive 1.2.1 dependencies can be compatible with most of Hive server
versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
Flink is currently mainly used for streaming, so let's not talk about hive.

For streaming jobs, first of all, the jobs in my mind is (related to
connectors):
- ETL jobs: Kafka -> Kafka
- Join jobs: Kafka -> DimJDBC -> Kafka
- Aggregation jobs: Kafka -> JDBCSink
So Kafka and JDBC are probably the most commonly used. Of course, also
includes CSV, JSON's formats.
So when we provide such a fat distribution:
- With CSV, JSON.
- With flink-kafka-universal and kafka dependencies.
- With flink-jdbc.
Using this fat distribution, most users can run their jobs well. (jdbc
driver jar required, but this is very natural to do)
Can these dependencies lead to kinds of conflicts? Only Kafka may have
conflicts, but if our goal is to use kafka-universal to support all Kafka
versions, it is hopeful to target the vast majority of users.

We don't want to plug all jars into the fat distribution. Only need less
conflict and common. of course, it is a matter of consideration to put
which jar into fat distribution.
We have the opportunity to facilitate the majority of users, but also left
opportunities for customization.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> wrote:

Hi,

I think we should first reach an consensus on "what problem do we want to
solve?"
(1) improve first experience? or (2) improve production experience?

As far as I can see, with the above discussion, I think what we want to
solve is the "first experience".
And I think the slim jar is still the best distribution for production,
because it's easier to assembling jars
than excluding jars and can avoid potential class conflicts.

If we want to improve "first experience", I think it make sense to have a
fat distribution to give users a more smooth first experience.
But I would like to call it "playground distribution" or something like
that to explicitly differ from the "slim production-purpose

distribution".

The "playground distribution" can contains some widely used jars, like
universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
csv, etc..
Even we can provide a playground docker which may contain the fat
distribution, python3, and hive.

Best,
Jark


On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org>

wrote:

I don't see a lot of value in having multiple distributions.

The simple reality is that no fat distribution we could provide would
satisfy all use-cases, so why even try.
If users commonly run into issues for certain jars, then maybe those
should be added to the current distribution.

Personally though I still believe we should only distribute a slim
version. I'd rather have users always add required jars to the
distribution than only when they go outside our "expected" use-cases.
Then we might finally address this issue properly, i.e., tooling to
assemble custom distributions and/or better error messages if
Flink-provided extensions cannot be found.

On 15/04/2020 15:23, Kurt Young wrote:

Regarding to the specific solution, I'm not sure about the "fat" and

"slim"

solution though. I get the idea
that we can make the slim one even more lightweight than current
distribution, but what about the "fat"
one? Do you mean that we would package all connectors and formats

into

this? I'm not sure if this is
feasible. For example, we can't put all versions of kafka and hive
connector jars into lib directory, and
we also might need hadoop jars when using filesystem connector to

access

data from HDFS.

So my guess would be we might hand-pick some of the most frequently

used

connectors and formats
into our "lib" directory, like kafka, csv, json metioned above, and

still

leave some other connectors out of it.
If this is the case, then why not we just provide this distribution

to

user? I'm not sure i get the benefit of
providing another super "slim" jar (we have to pay some costs to

provide

another suit of distribution).

What do you think?

Best,
Kurt


On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <jingsongl...@gmail.com>

wrote:

Big +1.

I like "fat" and "slim".

For csv and json, like Jark said, they are quite small and don't

have

other

dependencies. They are important to kafka connector, and important
to upcoming file system connector too.
So can we move them to both "fat" and "slim"? They're so important,

and

they're so lightweight.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com>

wrote:

Big +1.
This will improve user experience (special for Flink new users).
We answered so many questions about "class not found".

Best,
Godfrey

Dian Fu <dian0511...@gmail.com> 于2020年4月15日周三 下午4:30写道：

+1 to this proposal.

Missing connector jars is also a big problem for PyFlink users.

Currently,

after a Python user has installed PyFlink using `pip`, he has to

manually

copy the connector fat jars to the PyFlink installation directory

for

the

connectors to be used if he wants to run jobs locally. This

process

is

very

confuse for users and affects the experience a lot.

Regards,
Dian

在 2020年4月15日，下午3:51，Jark Wu <imj...@gmail.com> 写道：

+1 to the proposal. I also found the "download additional jar"

step

is

really verbose when I prepare webinars.

At least, I think the flink-csv and flink-json should in the

distribution,

they are quite small and don't have other dependencies.

Best,
Jark

On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com>

wrote:

Hi Aljoscha,

Big +1 for the fat flink distribution, where do you plan to put

these

connectors ? opt or lib ?

Aljoscha Krettek <aljos...@apache.org> 于2020年4月15日周三 下午3:30写道：

Hi Everyone,

I'd like to discuss about releasing a more full-featured Flink
distribution. The motivation is that there is friction for

SQL/Table

API

users that want to use Table connectors which are not there in

the

current Flink Distribution. For these users the workflow is

currently

roughly:

   - download Flink dist
   - configure csv/Kafka/json connectors per configuration
   - run SQL client or program
   - decrypt error message and research the solution
   - download additional connector jars
   - program works correctly

I realize that this can be made to work but if every SQL user

has

this

as their first experience that doesn't seem good to me.

My proposal is to provide two versions of the Flink

Distribution

in

the

future: "fat" and "slim" (names to be discussed):

   - slim would be even trimmer than todays distribution
   - fat would contain a lot of convenience connectors (yet to

be

determined which one)

And yes, I realize that there are already more dimensions of

Flink

releases (Scala version and Java version).

For background, our current Flink dist has these in the opt

directory:

   - flink-azure-fs-hadoop-1.10.0.jar
   - flink-cep-scala_2.12-1.10.0.jar
   - flink-cep_2.12-1.10.0.jar
   - flink-gelly-scala_2.12-1.10.0.jar
   - flink-gelly_2.12-1.10.0.jar
   - flink-metrics-datadog-1.10.0.jar
   - flink-metrics-graphite-1.10.0.jar
   - flink-metrics-influxdb-1.10.0.jar
   - flink-metrics-prometheus-1.10.0.jar
   - flink-metrics-slf4j-1.10.0.jar
   - flink-metrics-statsd-1.10.0.jar
   - flink-oss-fs-hadoop-1.10.0.jar
   - flink-python_2.12-1.10.0.jar
   - flink-queryable-state-runtime_2.12-1.10.0.jar
   - flink-s3-fs-hadoop-1.10.0.jar
   - flink-s3-fs-presto-1.10.0.jar
   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
   - flink-sql-client_2.12-1.10.0.jar
   - flink-state-processor-api_2.12-1.10.0.jar
   - flink-swift-fs-hadoop-1.10.0.jar

Current Flink dist is 267M. If we removed everything from opt

we

would

go down to 126M. I would reccomend this, because the large

majority

of

the files in opt are probably unused.

What do you think?

Best,
Aljoscha

--
Best Regards

Jeff Zhang


--
Best, Jingsong Lee



--
Best, Jingsong Lee

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to