Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Dawid Wysakowicz Thu, 16 Apr 2020 02:34:04 -0700

Hi all,

Few points from my side:


1. I like the idea of simplifying the experience for first time users.
As for production use cases I share Jark's opinion that in this case I
would expect users to combine their distribution manually. I think in
such scenarios it is important to understand interconnections.
Personally I'd expect the slimmest possible distribution that I can
extend further with what I need in my production scenario.

2. I think there is also the problem that the matrix of possible
combinations that can be useful is already big. Do we want to have a
distribution for:

    SQL users: which connectors should we include? should we include
hive? which other catalog?

    DataStream users: which connectors should we include?

   For both of the above should we include yarn/kubernetes?

I would opt for providing only the "slim" distribution as a release
artifact.

3. However, as I said I think its worth investigating how we can improve
users experience. What do you think of providing a tool, could be e.g. a
shell script that constructs a distribution based on users choice. I
think that was also what Chesnay mentioned as "tooling to
assemble custom distributions" In the end how I see the difference
between a slim and fat distribution is which jars do we put into the
lib, right? It could have a few "screens".

1. Which API are you interested in:
a. SQL API
b. DataStream API


2. [SQL] Which connectors do you want to use? [multichoice]:
a. Kafka
b. Elasticsearch
...

3. [SQL] Which catalog you want to use?

...

Such a tool would download all the dependencies from maven and put them
into the correct folder. In the future we can extend it with additional
rules e.g. kafka-0.9 cannot be chosen at the same time with
kafka-universal etc.

The benefit of it would be that the distribution that we release could
remain "slim" or we could even make it slimmer. I might be missing
something here though.

Best,

Dawdi

On 16/04/2020 11:02, Aljoscha Krettek wrote:
> I want to reinforce my opinion from earlier: This is about improving
> the situation both for first-time users and for experienced users that
> want to use a Flink dist in production. The current Flink dist is too
> "thin" for first-time SQL users and it is too "fat" for production
> users, that is where serving no-one properly with the current
> middle-ground. That's why I think introducing those specialized
> "spins" of Flink dist would be good.
>
> By the way, at some point in the future production users might not
> even need to get a Flink dist anymore. They should be able to have
> Flink as a dependency of their project (including the runtime) and
> then build an image from this for Kubernetes or a fat jar for YARN.
>
> Aljoscha
>
> On 15.04.20 18:14, wenlong.lwl wrote:
>> Hi all,
>>
>> Regarding slim and fat distributions, I think different kinds of jobs
>> may
>> prefer different type of distribution:
>>
>> For DataStream job, I think we may not like fat distribution containing
>> connectors because user would always need to depend on the connector in
>> user code, it is easy to include the connector jar in the user lib. Less
>> jar in lib means less class conflicts and problems.
>>
>> For SQL job, I think we are trying to encourage user to user pure
>> sql(DDL +
>> DML) to construct their job, In order to improve user experience, It
>> may be
>> important for flink, not only providing as many connector jar in
>> distribution as possible especially the connector and format we have
>> well
>> documented,  but also providing an mechanism to load connectors
>> according
>> to the DDLs,
>>
>> So I think it could be good to place connector/format jars in some
>> dir like
>> opt/connector which would not affect jobs by default, and introduce a
>> mechanism of dynamic discovery for SQL.
>>
>> Best,
>> Wenlong
>>
>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am thinking both "improve first experience" and "improve production
>>> experience".
>>>
>>> I'm thinking about what's the common mode of Flink?
>>> Streaming job use Kafka? Batch job use Hive?
>>>
>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>> Flink is currently mainly used for streaming, so let's not talk
>>> about hive.
>>>
>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>> connectors):
>>> - ETL jobs: Kafka -> Kafka
>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>> - Aggregation jobs: Kafka -> JDBCSink
>>> So Kafka and JDBC are probably the most commonly used. Of course, also
>>> includes CSV, JSON's formats.
>>> So when we provide such a fat distribution:
>>> - With CSV, JSON.
>>> - With flink-kafka-universal and kafka dependencies.
>>> - With flink-jdbc.
>>> Using this fat distribution, most users can run their jobs well. (jdbc
>>> driver jar required, but this is very natural to do)
>>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
>>> conflicts, but if our goal is to use kafka-universal to support all
>>> Kafka
>>> versions, it is hopeful to target the vast majority of users.
>>>
>>> We don't want to plug all jars into the fat distribution. Only need
>>> less
>>> conflict and common. of course, it is a matter of consideration to put
>>> which jar into fat distribution.
>>> We have the opportunity to facilitate the majority of users, but
>>> also left
>>> opportunities for customization.
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think we should first reach an consensus on "what problem do we
>>>> want to
>>>> solve?"
>>>> (1) improve first experience? or (2) improve production experience?
>>>>
>>>> As far as I can see, with the above discussion, I think what we
>>>> want to
>>>> solve is the "first experience".
>>>> And I think the slim jar is still the best distribution for
>>>> production,
>>>> because it's easier to assembling jars
>>>> than excluding jars and can avoid potential class conflicts.
>>>>
>>>> If we want to improve "first experience", I think it make sense to
>>>> have a
>>>> fat distribution to give users a more smooth first experience.
>>>> But I would like to call it "playground distribution" or something
>>>> like
>>>> that to explicitly differ from the "slim production-purpose
>>> distribution".
>>>> The "playground distribution" can contains some widely used jars, like
>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>> json,
>>>> csv, etc..
>>>> Even we can provide a playground docker which may contain the fat
>>>> distribution, python3, and hive.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org>
>>> wrote:
>>>>
>>>>> I don't see a lot of value in having multiple distributions.
>>>>>
>>>>> The simple reality is that no fat distribution we could provide would
>>>>> satisfy all use-cases, so why even try.
>>>>> If users commonly run into issues for certain jars, then maybe those
>>>>> should be added to the current distribution.
>>>>>
>>>>> Personally though I still believe we should only distribute a slim
>>>>> version. I'd rather have users always add required jars to the
>>>>> distribution than only when they go outside our "expected" use-cases.
>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>> assemble custom distributions and/or better error messages if
>>>>> Flink-provided extensions cannot be found.
>>>>>
>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
>>>>> "slim"
>>>>>> solution though. I get the idea
>>>>>> that we can make the slim one even more lightweight than current
>>>>>> distribution, but what about the "fat"
>>>>>> one? Do you mean that we would package all connectors and formats
>>> into
>>>>>> this? I'm not sure if this is
>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>> connector jars into lib directory, and
>>>>>> we also might need hadoop jars when using filesystem connector to
>>>> access
>>>>>> data from HDFS.
>>>>>>
>>>>>> So my guess would be we might hand-pick some of the most frequently
>>>> used
>>>>>> connectors and formats
>>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
>>>> still
>>>>>> leave some other connectors out of it.
>>>>>> If this is the case, then why not we just provide this distribution
>>> to
>>>>>> user? I'm not sure i get the benefit of
>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>> provide
>>>>>> another suit of distribution).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <jingsongl...@gmail.com>
>>>>> wrote:
>>>>>>
>>>>>>> Big +1.
>>>>>>>
>>>>>>> I like "fat" and "slim".
>>>>>>>
>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>> have
>>>>> other
>>>>>>> dependencies. They are important to kafka connector, and important
>>>>>>> to upcoming file system connector too.
>>>>>>> So can we move them to both "fat" and "slim"? They're so important,
>>>> and
>>>>>>> they're so lightweight.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com>
>>>> wrote:
>>>>>>>
>>>>>>>> Big +1.
>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Godfrey
>>>>>>>>
>>>>>>>> Dian Fu <dian0511...@gmail.com> 于2020年4月15日周三 下午4:30写道：
>>>>>>>>
>>>>>>>>> +1 to this proposal.
>>>>>>>>>
>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>> Currently,
>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
>>>>>>> manually
>>>>>>>>> copy the connector fat jars to the PyFlink installation directory
>>>> for
>>>>>>> the
>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>> process
>>>> is
>>>>>>>> very
>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Dian
>>>>>>>>>
>>>>>>>>>> 在 2020年4月15日，下午3:51，Jark Wu <imj...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>> step
>>>>>>> is
>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>
>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>> distribution,
>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com>
>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>>>>>>> these
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>
>>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道：
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>> SQL/Table
>>>>>>>>> API
>>>>>>>>>>>> users that want to use Table connectors which are not there in
>>>> the
>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>> currently
>>>>>>>>>>>> roughly:
>>>>>>>>>>>>
>>>>>>>>>>>>    - download Flink dist
>>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>    - run SQL client or program
>>>>>>>>>>>>    - decrypt error message and research the solution
>>>>>>>>>>>>    - download additional connector jars
>>>>>>>>>>>>    - program works correctly
>>>>>>>>>>>>
>>>>>>>>>>>> I realize that this can be made to work but if every SQL user
>>> has
>>>>>>>> this
>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>
>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>> Distribution
>>>> in
>>>>>>>> the
>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>
>>>>>>>>>>>>    - slim would be even trimmer than todays distribution
>>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
>>> be
>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>
>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>> Flink
>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>
>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>> directory:
>>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>
>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
>>> we
>>>>>>>> would
>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>> majority
>>>>>>>> of
>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> -- 
>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> -- 
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> -- 
>>> Best, Jingsong Lee
>>>
>>
>

signature.asc
Description: OpenPGP digital signature

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Reply via email to