Hi all, Few points from my side:
1. I like the idea of simplifying the experience for first time users. As for production use cases I share Jark's opinion that in this case I would expect users to combine their distribution manually. I think in such scenarios it is important to understand interconnections. Personally I'd expect the slimmest possible distribution that I can extend further with what I need in my production scenario. 2. I think there is also the problem that the matrix of possible combinations that can be useful is already big. Do we want to have a distribution for: SQL users: which connectors should we include? should we include hive? which other catalog? DataStream users: which connectors should we include? For both of the above should we include yarn/kubernetes? I would opt for providing only the "slim" distribution as a release artifact. 3. However, as I said I think its worth investigating how we can improve users experience. What do you think of providing a tool, could be e.g. a shell script that constructs a distribution based on users choice. I think that was also what Chesnay mentioned as "tooling to assemble custom distributions" In the end how I see the difference between a slim and fat distribution is which jars do we put into the lib, right? It could have a few "screens". 1. Which API are you interested in: a. SQL API b. DataStream API 2. [SQL] Which connectors do you want to use? [multichoice]: a. Kafka b. Elasticsearch ... 3. [SQL] Which catalog you want to use? ... Such a tool would download all the dependencies from maven and put them into the correct folder. In the future we can extend it with additional rules e.g. kafka-0.9 cannot be chosen at the same time with kafka-universal etc. The benefit of it would be that the distribution that we release could remain "slim" or we could even make it slimmer. I might be missing something here though. Best, Dawdi On 16/04/2020 11:02, Aljoscha Krettek wrote: > I want to reinforce my opinion from earlier: This is about improving > the situation both for first-time users and for experienced users that > want to use a Flink dist in production. The current Flink dist is too > "thin" for first-time SQL users and it is too "fat" for production > users, that is where serving no-one properly with the current > middle-ground. That's why I think introducing those specialized > "spins" of Flink dist would be good. > > By the way, at some point in the future production users might not > even need to get a Flink dist anymore. They should be able to have > Flink as a dependency of their project (including the runtime) and > then build an image from this for Kubernetes or a fat jar for YARN. > > Aljoscha > > On 15.04.20 18:14, wenlong.lwl wrote: >> Hi all, >> >> Regarding slim and fat distributions, I think different kinds of jobs >> may >> prefer different type of distribution: >> >> For DataStream job, I think we may not like fat distribution containing >> connectors because user would always need to depend on the connector in >> user code, it is easy to include the connector jar in the user lib. Less >> jar in lib means less class conflicts and problems. >> >> For SQL job, I think we are trying to encourage user to user pure >> sql(DDL + >> DML) to construct their job, In order to improve user experience, It >> may be >> important for flink, not only providing as many connector jar in >> distribution as possible especially the connector and format we have >> well >> documented, but also providing an mechanism to load connectors >> according >> to the DDLs, >> >> So I think it could be good to place connector/format jars in some >> dir like >> opt/connector which would not affect jobs by default, and introduce a >> mechanism of dynamic discovery for SQL. >> >> Best, >> Wenlong >> >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <jingsongl...@gmail.com> >> wrote: >> >>> Hi, >>> >>> I am thinking both "improve first experience" and "improve production >>> experience". >>> >>> I'm thinking about what's the common mode of Flink? >>> Streaming job use Kafka? Batch job use Hive? >>> >>> Hive 1.2.1 dependencies can be compatible with most of Hive server >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency. >>> Flink is currently mainly used for streaming, so let's not talk >>> about hive. >>> >>> For streaming jobs, first of all, the jobs in my mind is (related to >>> connectors): >>> - ETL jobs: Kafka -> Kafka >>> - Join jobs: Kafka -> DimJDBC -> Kafka >>> - Aggregation jobs: Kafka -> JDBCSink >>> So Kafka and JDBC are probably the most commonly used. Of course, also >>> includes CSV, JSON's formats. >>> So when we provide such a fat distribution: >>> - With CSV, JSON. >>> - With flink-kafka-universal and kafka dependencies. >>> - With flink-jdbc. >>> Using this fat distribution, most users can run their jobs well. (jdbc >>> driver jar required, but this is very natural to do) >>> Can these dependencies lead to kinds of conflicts? Only Kafka may have >>> conflicts, but if our goal is to use kafka-universal to support all >>> Kafka >>> versions, it is hopeful to target the vast majority of users. >>> >>> We don't want to plug all jars into the fat distribution. Only need >>> less >>> conflict and common. of course, it is a matter of consideration to put >>> which jar into fat distribution. >>> We have the opportunity to facilitate the majority of users, but >>> also left >>> opportunities for customization. >>> >>> Best, >>> Jingsong Lee >>> >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I think we should first reach an consensus on "what problem do we >>>> want to >>>> solve?" >>>> (1) improve first experience? or (2) improve production experience? >>>> >>>> As far as I can see, with the above discussion, I think what we >>>> want to >>>> solve is the "first experience". >>>> And I think the slim jar is still the best distribution for >>>> production, >>>> because it's easier to assembling jars >>>> than excluding jars and can avoid potential class conflicts. >>>> >>>> If we want to improve "first experience", I think it make sense to >>>> have a >>>> fat distribution to give users a more smooth first experience. >>>> But I would like to call it "playground distribution" or something >>>> like >>>> that to explicitly differ from the "slim production-purpose >>> distribution". >>>> The "playground distribution" can contains some widely used jars, like >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, >>>> json, >>>> csv, etc.. >>>> Even we can provide a playground docker which may contain the fat >>>> distribution, python3, and hive. >>>> >>>> Best, >>>> Jark >>>> >>>> >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <ches...@apache.org> >>> wrote: >>>> >>>>> I don't see a lot of value in having multiple distributions. >>>>> >>>>> The simple reality is that no fat distribution we could provide would >>>>> satisfy all use-cases, so why even try. >>>>> If users commonly run into issues for certain jars, then maybe those >>>>> should be added to the current distribution. >>>>> >>>>> Personally though I still believe we should only distribute a slim >>>>> version. I'd rather have users always add required jars to the >>>>> distribution than only when they go outside our "expected" use-cases. >>>>> Then we might finally address this issue properly, i.e., tooling to >>>>> assemble custom distributions and/or better error messages if >>>>> Flink-provided extensions cannot be found. >>>>> >>>>> On 15/04/2020 15:23, Kurt Young wrote: >>>>>> Regarding to the specific solution, I'm not sure about the "fat" and >>>>> "slim" >>>>>> solution though. I get the idea >>>>>> that we can make the slim one even more lightweight than current >>>>>> distribution, but what about the "fat" >>>>>> one? Do you mean that we would package all connectors and formats >>> into >>>>>> this? I'm not sure if this is >>>>>> feasible. For example, we can't put all versions of kafka and hive >>>>>> connector jars into lib directory, and >>>>>> we also might need hadoop jars when using filesystem connector to >>>> access >>>>>> data from HDFS. >>>>>> >>>>>> So my guess would be we might hand-pick some of the most frequently >>>> used >>>>>> connectors and formats >>>>>> into our "lib" directory, like kafka, csv, json metioned above, and >>>> still >>>>>> leave some other connectors out of it. >>>>>> If this is the case, then why not we just provide this distribution >>> to >>>>>> user? I'm not sure i get the benefit of >>>>>> providing another super "slim" jar (we have to pay some costs to >>>> provide >>>>>> another suit of distribution). >>>>>> >>>>>> What do you think? >>>>>> >>>>>> Best, >>>>>> Kurt >>>>>> >>>>>> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <jingsongl...@gmail.com> >>>>> wrote: >>>>>> >>>>>>> Big +1. >>>>>>> >>>>>>> I like "fat" and "slim". >>>>>>> >>>>>>> For csv and json, like Jark said, they are quite small and don't >>> have >>>>> other >>>>>>> dependencies. They are important to kafka connector, and important >>>>>>> to upcoming file system connector too. >>>>>>> So can we move them to both "fat" and "slim"? They're so important, >>>> and >>>>>>> they're so lightweight. >>>>>>> >>>>>>> Best, >>>>>>> Jingsong Lee >>>>>>> >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com> >>>> wrote: >>>>>>> >>>>>>>> Big +1. >>>>>>>> This will improve user experience (special for Flink new users). >>>>>>>> We answered so many questions about "class not found". >>>>>>>> >>>>>>>> Best, >>>>>>>> Godfrey >>>>>>>> >>>>>>>> Dian Fu <dian0511...@gmail.com> 于2020年4月15日周三 下午4:30写道: >>>>>>>> >>>>>>>>> +1 to this proposal. >>>>>>>>> >>>>>>>>> Missing connector jars is also a big problem for PyFlink users. >>>>>>>> Currently, >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to >>>>>>> manually >>>>>>>>> copy the connector fat jars to the PyFlink installation directory >>>> for >>>>>>> the >>>>>>>>> connectors to be used if he wants to run jobs locally. This >>> process >>>> is >>>>>>>> very >>>>>>>>> confuse for users and affects the experience a lot. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> Dian >>>>>>>>> >>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> 写道: >>>>>>>>>> >>>>>>>>>> +1 to the proposal. I also found the "download additional jar" >>> step >>>>>>> is >>>>>>>>>> really verbose when I prepare webinars. >>>>>>>>>> >>>>>>>>>> At least, I think the flink-csv and flink-json should in the >>>>>>>>> distribution, >>>>>>>>>> they are quite small and don't have other dependencies. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Jark >>>>>>>>>> >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> >>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Aljoscha, >>>>>>>>>>> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put >>>>>>> these >>>>>>>>>>> connectors ? opt or lib ? >>>>>>>>>>> >>>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> 于2020年4月15日周三 >>>>>>>>>>> 下午3:30写道: >>>>>>>>>>> >>>>>>>>>>>> Hi Everyone, >>>>>>>>>>>> >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink >>>>>>>>>>>> distribution. The motivation is that there is friction for >>>>>>> SQL/Table >>>>>>>>> API >>>>>>>>>>>> users that want to use Table connectors which are not there in >>>> the >>>>>>>>>>>> current Flink Distribution. For these users the workflow is >>>>>>> currently >>>>>>>>>>>> roughly: >>>>>>>>>>>> >>>>>>>>>>>> - download Flink dist >>>>>>>>>>>> - configure csv/Kafka/json connectors per configuration >>>>>>>>>>>> - run SQL client or program >>>>>>>>>>>> - decrypt error message and research the solution >>>>>>>>>>>> - download additional connector jars >>>>>>>>>>>> - program works correctly >>>>>>>>>>>> >>>>>>>>>>>> I realize that this can be made to work but if every SQL user >>> has >>>>>>>> this >>>>>>>>>>>> as their first experience that doesn't seem good to me. >>>>>>>>>>>> >>>>>>>>>>>> My proposal is to provide two versions of the Flink >>> Distribution >>>> in >>>>>>>> the >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed): >>>>>>>>>>>> >>>>>>>>>>>> - slim would be even trimmer than todays distribution >>>>>>>>>>>> - fat would contain a lot of convenience connectors (yet to >>> be >>>>>>>>>>>> determined which one) >>>>>>>>>>>> >>>>>>>>>>>> And yes, I realize that there are already more dimensions of >>>> Flink >>>>>>>>>>>> releases (Scala version and Java version). >>>>>>>>>>>> >>>>>>>>>>>> For background, our current Flink dist has these in the opt >>>>>>>> directory: >>>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar >>>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar >>>>>>>>>>>> - flink-cep_2.12-1.10.0.jar >>>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar >>>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar >>>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar >>>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar >>>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar >>>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar >>>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar >>>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar >>>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar >>>>>>>>>>>> - flink-python_2.12-1.10.0.jar >>>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar >>>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar >>>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar >>>>>>>>>>>> - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar >>>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar >>>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar >>>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar >>>>>>>>>>>> >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt >>> we >>>>>>>> would >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large >>>> majority >>>>>>>> of >>>>>>>>>>>> the files in opt are probably unused. >>>>>>>>>>>> >>>>>>>>>>>> What do you think? >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Aljoscha >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best Regards >>>>>>>>>>> >>>>>>>>>>> Jeff Zhang >>>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best, Jingsong Lee >>>>>>> >>>>> >>>>> >>>> >>> >>> >>> -- >>> Best, Jingsong Lee >>> >> >
signature.asc
Description: OpenPGP digital signature