Are you suggesting to add the SQL dependencies to opt/ or lib/? I thought the argument against opt/ was that it would not be much different from downloading the additional dependencies.
Moving it to lib/ would justify in my opinion a separate release because of potential dependency conflicts for users who don't want to use SQL. Cheers, Till On Tue, May 5, 2020 at 10:01 AM Aljoscha Krettek <aljos...@apache.org> wrote: > Thanks Till for summarizing! > > Another alternative is also to stick to one distribution but remove one > of the very heavy filesystem connectors and add all the mentioned SQL > connectors/formats, which will keep the size of the distribution the > same, or a bit smaller. > > Best, > Aljoscha > > On 04.05.20 18:59, Till Rohrmann wrote: > > Thanks everyone for this lively discussion and all your thoughts. > > > > Let me try to summarise the current state of the discussion and then > let's > > see how we can move it forward. > > > > To begin with, I think everyone agrees that we want to improve Flink's > user > > experience. In particular, we want to improve the experience of first > time > > users who want to try out Flink's SQL functionality. > > > > The problem which stands in the way of a good user experience is that the > > current Flink distribution contains too few dependencies for a smooth > first > > time SQL experience and too many dependencies for a lean production > setup. > > Hence, Aljoscha proposed to create a "fat" and "slim" Flink distribution > > addressing these two differing needs. > > > > As far as the discussion goes there are two remaining discussion points. > > > > 1. How do we serve the different types of distributions? > > > > a) Create a "fat" and "slim" distribution which is served from the Flink > > web site. > > b) Create a "slim" distribution which is served from the Flink web site > and > > have a tool (e.g. script) which can turn a slim distribution into a fat > > distribution by downloading additional dependencies. > > > > For a) speaks that it is simpler and does not require the user to execute > > an additional step. The downside is that we will add another dimension to > > the release matrix which will complicate the release process (see > Chesnay's > > last comment for more details). > > > > For b) speaks that it is potentially the more general solution as we can > > provide different options for different distributions (e.g. choosing a > > connector version, required filesystems, metric reporters, etc.). The > > downside is the additional step for the user and that we need such a tool > > (which in itself could be quite complex). > > > > 2. What is contained in the "fat" distribution? > > > > The current proposal is to move everything which can be moved from opt to > > the plugins directory to the plugins directory (metric reporters and > > filesystems). That way the user will be able to use all of these > > implementations without running into dependency conflicts. > > > > For the SQL support, Aljoscha proposed to add: > > > > flink-avro-1.10.0.jar > > flink-csv-1.10.0.jar > > flink-hbase_2.11-1.10.0.jar > > flink-jdbc_2.11-1.10.0.jar > > flink-json-1.10.0.jar > > flink-sql-connector-elasticsearch6_2.11-1.10.0.jar > > flink-sql-connector-kafka_2.11-1.10.0.jar > > sql-connectors-formats > > > > How to move forward from here? > > > > Given that the time until the feature freeze is limited I would actually > > propose to follow the simplest approach which is the creation of two > > distributions ("fat" & "slim"). We can still rethink this decision at a > > later point and introduce a tool which allows to download a custom build > > Flink distribution. At this point we could then remove the "fat" jar from > > the web site. Of course, this comes at the cost of increased release > > complexity but I believe that the user experience will make up for it. > > > > For the what to include, I think we could take Aljoscha's proposal and > then > > see what other dependencies the most common SQL use cases require. I > guess > > that the SQL guys know quite precisely where the users run into problems. > > > > I know that this solution might not be perfect (in particular wrt > releases) > > but I hope that everyone could live with this solution for the time > being. > > > > Feel free to add anything I might have forgotten to mention here. > > > > Cheers, > > Till > > > > On Tue, Apr 28, 2020 at 11:43 AM Chesnay Schepler <ches...@apache.org> > > wrote: > > > >> It would be good if we could nail down what a slim/fat distribution > >> would look like, as there are various ideas floating around in this > thread. > >> > >> Like, what is a "slim" distribution? Are we just emptying /opt? Removing > >> everything larger than 1mb? Are we throwing out the Table API from /lib > >> for a minimal streaming distribution? > >> Are we going ham and remove the YARN integration from the flink-dist > jar? > >> > >> While I can see how a fat distribution can certainly help for the > >> out-of-the-box experience, I'm not so sold on the slim variant. > >> If someone is capable of assembling a distribution matching to their > >> use-case, do they even need a slim distribution in the first place? > >> > >> I really want us to stick to 1 distribution type, as I'm worried about > >> the implications of 2 or FWIW any number of additional distribution > types: > >> > >> - you need separate assemblies, including a new profile > >> - adjusting opt/plugins and making sure the examples match the > >> bundled contents (e.g., no gelly/python, maybe some SQL examples if > >> there are any that use a connector) > >> - another 300mb uploaded to dist.apache.org + whatever the fat > >> distribution grows by x3 (scala 2.11/2.12 + python) > >> - the latter naturally being susceptible to additional growth in > >> the future > >> - this is also a pain for release managers since SVN likes to > throw > >> up if the upload is too large + it increases upload time > >> - another 2 distributions to test during a release > >> - another distribution type we need to test via CI > >> - more content downloaded into the docker images by default > >> - unless of course we release separate slim/fat images (where we > >> would then circle back to the above 2 points, just docker-flavored) > >> - any further addition to the release matrix implies an additional 4 > >> distributions => long-term ramifications > >> - e.g., another scala version > >> > >> On 24/04/2020 15:15, Kurt Young wrote: > >>> +1 for "slim" and "fat" solution. One comment about the fat one, I > think > >> we > >>> need to > >>> put all needed jars into /lib (or /plugins). Put jars into /opt and > >> relying > >>> on users moving > >>> them from /opt to /lib doesn't really improve the out-of-box > experience. > >>> > >>> Best, > >>> Kurt > >>> > >>> > >>> On Fri, Apr 24, 2020 at 8:28 PM Aljoscha Krettek <aljos...@apache.org> > >>> wrote: > >>> > >>>> re (1): I don't know about that, probably the people that did the > >>>> metrics reporter plugin support had some thoughts about that. > >>>> > >>>> re (2): I agree, that's why I initially suggested to split it into > >>>> "slim" and "fat" because our current "medium fat" selection of jars in > >>>> Flink dist does not serve anyone too well. It's too fat for people > that > >>>> want to build lean application images. It's to lean for people that > want > >>>> a good first out-of-box experience. > >>>> > >>>> Aljoscha > >>>> > >>>> On 17.04.20 16:38, Stephan Ewen wrote: > >>>>> @Aljoscha I think that is an interesting line of thinking. the > swift-fs > >>>> may > >>>>> be rarely enough used to move it to an optional download. > >>>>> > >>>>> I would still drop two more thoughts: > >>>>> > >>>>> (1) Now that we have plugins support, is there a reason to have a > >> metrics > >>>>> reporter or file system in /opt instead of /plugins? They don't spoil > >> the > >>>>> class path any more. > >>>>> > >>>>> (2) I can imagine there still being a desire to have a "minimal" > docker > >>>>> file, for users that want to keep the container images as small as > >>>>> possible, to speed up deployment. It is fine if that would not be the > >>>>> default, though. > >>>>> > >>>>> > >>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < > aljos...@apache.org > >>> > >>>>> wrote: > >>>>> > >>>>>> I think having such tools and/or tailor-made distributions can be > nice > >>>>>> but I also think the discussion is missing the main point: The > initial > >>>>>> observation/motivation is that apparently a lot of users (Kurt and I > >>>>>> talked about this) on the chinese DingTalk support groups, and other > >>>>>> support channels have problems when first using the SQL client > because > >>>>>> of these missing connectors/formats. For these, having additional > >> tools > >>>>>> would not solve anything because they would also not take that extra > >>>>>> step. I think that even tiny friction should be avoided because the > >>>>>> annoyance from it accumulates of the (hopefully) many users that we > >> want > >>>>>> to have. > >>>>>> > >>>>>> Maybe we should take a step back from discussing the "fat"/"slim" > idea > >>>>>> and instead think about the composition of the current dist. As > >>>>>> mentioned we have these jars in opt/: > >>>>>> > >>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar > >>>>>> 52K flink-cep-scala_2.11-1.10.0.jar > >>>>>> 180K flink-cep_2.11-1.10.0.jar > >>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar > >>>>>> 626K flink-gelly_2.11-1.10.0.jar > >>>>>> 512K flink-metrics-datadog-1.10.0.jar > >>>>>> 159K flink-metrics-graphite-1.10.0.jar > >>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar > >>>>>> 102K flink-metrics-prometheus-1.10.0.jar > >>>>>> 10K flink-metrics-slf4j-1.10.0.jar > >>>>>> 12K flink-metrics-statsd-1.10.0.jar > >>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar > >>>>>> 28M flink-python_2.11-1.10.0.jar > >>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar > >>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar > >>>>>> 31M flink-s3-fs-presto-1.10.0.jar > >>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > >>>>>> 518K flink-sql-client_2.11-1.10.0.jar > >>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar > >>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar > >>>>>> 160M opt > >>>>>> > >>>>>> The "filesystem" connectors ar ethe heavy hitters, there. > >>>>>> > >>>>>> I downloaded most of the SQL connectors/formats and this is what I > >> got: > >>>>>> > >>>>>> 73K flink-avro-1.10.0.jar > >>>>>> 36K flink-csv-1.10.0.jar > >>>>>> 55K flink-hbase_2.11-1.10.0.jar > >>>>>> 88K flink-jdbc_2.11-1.10.0.jar > >>>>>> 42K flink-json-1.10.0.jar > >>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar > >>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar > >>>>>> 24M sql-connectors-formats > >>>>>> > >>>>>> We could just add these to the Flink distribution without blowing it > >> up > >>>>>> by much. We could drop any of the existing "filesystem" connectors > >> from > >>>>>> opt and add the SQL connectors/formats and not change the size of > >> Flink > >>>>>> dist. So maybe we should do that instead? > >>>>>> > >>>>>> We would need some tooling for the sql-client shell script to > pick-up > >>>>>> the connectors/formats up from opt/ because we don't want to add > them > >> to > >>>>>> lib/. We're already doing that for finding the flink-sql-client jar, > >>>>>> which is also not in lib/. > >>>>>> > >>>>>> What do you think? > >>>>>> > >>>>>> Best, > >>>>>> Aljoscha > >>>>>> > >>>>>> On 17.04.20 05:22, Jark Wu wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> I like the idea of web tool to assemble fat distribution. And the > >>>>>>> https://code.quarkus.io/ looks very nice. > >>>>>>> All the users need to do is just select what he/she need (I think > >> this > >>>>>> step > >>>>>>> can't be omitted anyway). > >>>>>>> We can also provide a default fat distribution on the web which > >> default > >>>>>>> selects some popular connectors. > >>>>>>> > >>>>>>> Best, > >>>>>>> Jark > >>>>>>> > >>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <rafi.ar...@gmail.com> > >> wrote: > >>>>>>> > >>>>>>>> As a reference for a nice first-experience I had, take a look at > >>>>>>>> https://code.quarkus.io/ > >>>>>>>> You reach this page after you click "Start Coding" at the project > >>>>>> homepage. > >>>>>>>> Rafi > >>>>>>>> > >>>>>>>> > >>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <ykt...@gmail.com> > >> wrote: > >>>>>>>> > >>>>>>>>> I'm not saying pre-bundle some jars will make this problem go > away, > >>>> and > >>>>>>>>> you're right that only hides the problem for > >>>>>>>>> some users. But what if this solution can hide the problem for > 90% > >>>>>> users? > >>>>>>>>> Would't that be good enough for us to try? > >>>>>>>>> > >>>>>>>>> Regarding to would users following instructions really be such a > >> big > >>>>>>>>> problem? > >>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions for at > >> least > >>>> a > >>>>>>>>> dozen times and I won't see such questions coming > >>>>>>>>> up from time to time. During some periods, I even saw such > >> questions > >>>>>>>> every > >>>>>>>>> day. > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Kurt > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < > >>>> ches...@apache.org> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> The problem with having a distribution with "popular" stuff is > >> that > >>>> it > >>>>>>>>>> doesn't really *solve* a problem, it just hides it for users who > >>>> fall > >>>>>>>>>> into these particular use-cases. > >>>>>>>>>> Move out of it and you once again run into exact same problems > >>>>>>>> out-lined. > >>>>>>>>>> This is exactly why I like the tooling approach; you have to > deal > >>>> with > >>>>>>>> it > >>>>>>>>>> from the start and transitioning to a custom use-case is easier. > >>>>>>>>>> > >>>>>>>>>> Would users following instructions really be such a big problem? > >>>>>>>>>> I would expect that users generally know *what *they need, just > >> not > >>>>>>>>>> necessarily how it is assembled correctly (where do get which > jar, > >>>>>>>> which > >>>>>>>>>> directory to put it in). > >>>>>>>>>> It seems like these are exactly the problem this would solve? > >>>>>>>>>> I just don't see how moving a jar corresponding to some feature > >> from > >>>>>>>> opt > >>>>>>>>>> to some directory (lib/plugins) is less error-prone than just > >>>>>> selecting > >>>>>>>>> the > >>>>>>>>>> feature and having the tool handle the rest. > >>>>>>>>>> > >>>>>>>>>> As for re-distributions, it depends on the form that the tool > >> would > >>>>>>>> take. > >>>>>>>>>> It could be an application that runs locally and works against > >> maven > >>>>>>>>>> central (note: not necessarily *using* maven); this should would > >>>> work > >>>>>>>> in > >>>>>>>>>> China, no? > >>>>>>>>>> > >>>>>>>>>> A web tool would of course be fancy, but I don't know how > feasible > >>>>>> this > >>>>>>>>> is > >>>>>>>>>> with the ASF infrastructure. > >>>>>>>>>> You wouldn't be able to mirror the distribution, so the load > can't > >>>> be > >>>>>>>>>> distributed. I doubt INFRA would like this. > >>>>>>>>>> > >>>>>>>>>> Note that third-parties could also start distributing use-case > >>>>>> oriented > >>>>>>>>>> distributions, which would be perfectly fine as far as I'm > >>>> concerned. > >>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote: > >>>>>>>>>> > >>>>>>>>>> I'm not so sure about the web tool solution though. The concern > I > >>>> have > >>>>>>>>> for > >>>>>>>>>> this approach is the final generated > >>>>>>>>>> distribution is kind of non-deterministic. We might generate too > >>>> many > >>>>>>>>>> different combinations when user trying to > >>>>>>>>>> package different types of connector, format, and even maybe > >> hadoop > >>>>>>>>>> releases. As far as I can tell, most open > >>>>>>>>>> source projects and apache projects will only release some > >>>>>>>>>> pre-defined distributions, which most users are already > >>>>>>>>>> familiar with, thus hard to change IMO. And I also have went > >> through > >>>>>> in > >>>>>>>>>> some cases, users will try to re-distribute > >>>>>>>>>> the release package, because of the unstable network of apache > >>>> website > >>>>>>>>> from > >>>>>>>>>> China. In web tool solution, I don't > >>>>>>>>>> think this kind of re-distribution would be possible anymore. > >>>>>>>>>> > >>>>>>>>>> In the meantime, I also have a concern that we will fall back > into > >>>> our > >>>>>>>>> trap > >>>>>>>>>> again if we try to offer this smart & flexible > >>>>>>>>>> solution. Because it needs users to cooperate with such > mechanism. > >>>>>> It's > >>>>>>>>>> exactly the situation what we currently fell > >>>>>>>>>> into: > >>>>>>>>>> 1. We offered a smart solution. > >>>>>>>>>> 2. We hope users will follow the correct instructions. > >>>>>>>>>> 3. Everything will work as expected if users followed the right > >>>>>>>>>> instructions. > >>>>>>>>>> > >>>>>>>>>> In reality, I suspect not all users will do the second step > >>>> correctly. > >>>>>>>>> And > >>>>>>>>>> for new users who only trying to have a quick > >>>>>>>>>> experience with Flink, I would bet most users will do it wrong. > >>>>>>>>>> > >>>>>>>>>> So, my proposal would be one of the following 2 options: > >>>>>>>>>> 1. Provide a slim distribution for advanced product users and > >>>> provide > >>>>>> a > >>>>>>>>>> distribution which will have some popular builtin jars. > >>>>>>>>>> 2. Only provide a distribution which will have some popular > >> builtin > >>>>>>>> jars. > >>>>>>>>>> If we are trying to reduce the distributions we released, I > would > >>>>>>>> prefer > >>>>>>>>> 2 > >>>>>>>>>> 1. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Kurt > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < > >> trohrm...@apache.org > >>>>>> < > >>>>>>>>> trohrm...@apache.org> wrote: > >>>>>>>>>> > >>>>>>>>>> I think what Chesnay and Dawid proposed would be the ideal > >> solution. > >>>>>>>>>> Ideally, we would also have a nice web tool for the website > which > >>>>>>>>> generates > >>>>>>>>>> the corresponding distribution for download. > >>>>>>>>>> > >>>>>>>>>> To get things started we could start with only supporting to > >>>>>>>>>> download/creating the "fat" version with the script. The fat > >> version > >>>>>>>>> would > >>>>>>>>>> then consist of the slim distribution and whatever we deem > >> important > >>>>>>>> for > >>>>>>>>>> new users to get started. > >>>>>>>>>> > >>>>>>>>>> Cheers, > >>>>>>>>>> Till > >>>>>>>>>> > >>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > >>>>>>>>> dwysakow...@apache.org> <dwysakow...@apache.org> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> Few points from my side: > >>>>>>>>>> > >>>>>>>>>> 1. I like the idea of simplifying the experience for first time > >>>> users. > >>>>>>>>>> As for production use cases I share Jark's opinion that in this > >>>> case I > >>>>>>>>>> would expect users to combine their distribution manually. I > think > >>>> in > >>>>>>>>>> such scenarios it is important to understand interconnections. > >>>>>>>>>> Personally I'd expect the slimmest possible distribution that I > >> can > >>>>>>>>>> extend further with what I need in my production scenario. > >>>>>>>>>> > >>>>>>>>>> 2. I think there is also the problem that the matrix of possible > >>>>>>>>>> combinations that can be useful is already big. Do we want to > >> have a > >>>>>>>>>> distribution for: > >>>>>>>>>> > >>>>>>>>>> SQL users: which connectors should we include? should we > >>>> include > >>>>>>>>>> hive? which other catalog? > >>>>>>>>>> > >>>>>>>>>> DataStream users: which connectors should we include? > >>>>>>>>>> > >>>>>>>>>> For both of the above should we include yarn/kubernetes? > >>>>>>>>>> > >>>>>>>>>> I would opt for providing only the "slim" distribution as a > >> release > >>>>>>>>>> artifact. > >>>>>>>>>> > >>>>>>>>>> 3. However, as I said I think its worth investigating how we can > >>>>>>>> improve > >>>>>>>>>> users experience. What do you think of providing a tool, could > be > >>>> e.g. > >>>>>>>> a > >>>>>>>>>> shell script that constructs a distribution based on users > >> choice. I > >>>>>>>>>> think that was also what Chesnay mentioned as "tooling to > >>>>>>>>>> assemble custom distributions" In the end how I see the > difference > >>>>>>>>>> between a slim and fat distribution is which jars do we put into > >> the > >>>>>>>>>> lib, right? It could have a few "screens". > >>>>>>>>>> > >>>>>>>>>> 1. Which API are you interested in: > >>>>>>>>>> a. SQL API > >>>>>>>>>> b. DataStream API > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 2. [SQL] Which connectors do you want to use? [multichoice]: > >>>>>>>>>> a. Kafka > >>>>>>>>>> b. Elasticsearch > >>>>>>>>>> ... > >>>>>>>>>> > >>>>>>>>>> 3. [SQL] Which catalog you want to use? > >>>>>>>>>> > >>>>>>>>>> ... > >>>>>>>>>> > >>>>>>>>>> Such a tool would download all the dependencies from maven and > put > >>>>>> them > >>>>>>>>>> into the correct folder. In the future we can extend it with > >>>>>> additional > >>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time with > >>>>>>>>>> kafka-universal etc. > >>>>>>>>>> > >>>>>>>>>> The benefit of it would be that the distribution that we release > >>>> could > >>>>>>>>>> remain "slim" or we could even make it slimmer. I might be > missing > >>>>>>>>>> something here though. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> > >>>>>>>>>> Dawdi > >>>>>>>>>> > >>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote: > >>>>>>>>>> > >>>>>>>>>> I want to reinforce my opinion from earlier: This is about > >> improving > >>>>>>>>>> the situation both for first-time users and for experienced > users > >>>> that > >>>>>>>>>> want to use a Flink dist in production. The current Flink dist > is > >>>> too > >>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for > production > >>>>>>>>>> users, that is where serving no-one properly with the current > >>>>>>>>>> middle-ground. That's why I think introducing those specialized > >>>>>>>>>> "spins" of Flink dist would be good. > >>>>>>>>>> > >>>>>>>>>> By the way, at some point in the future production users might > not > >>>>>>>>>> even need to get a Flink dist anymore. They should be able to > have > >>>>>>>>>> Flink as a dependency of their project (including the runtime) > and > >>>>>>>>>> then build an image from this for Kubernetes or a fat jar for > >> YARN. > >>>>>>>>>> > >>>>>>>>>> Aljoscha > >>>>>>>>>> > >>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote: > >>>>>>>>>> > >>>>>>>>>> Hi all, > >>>>>>>>>> > >>>>>>>>>> Regarding slim and fat distributions, I think different kinds of > >>>> jobs > >>>>>>>>>> may > >>>>>>>>>> prefer different type of distribution: > >>>>>>>>>> > >>>>>>>>>> For DataStream job, I think we may not like fat distribution > >>>>>>>>>> > >>>>>>>>>> containing > >>>>>>>>>> > >>>>>>>>>> connectors because user would always need to depend on the > >> connector > >>>>>>>>>> > >>>>>>>>>> in > >>>>>>>>>> > >>>>>>>>>> user code, it is easy to include the connector jar in the user > >> lib. > >>>>>>>>>> > >>>>>>>>>> Less > >>>>>>>>>> > >>>>>>>>>> jar in lib means less class conflicts and problems. > >>>>>>>>>> > >>>>>>>>>> For SQL job, I think we are trying to encourage user to user > pure > >>>>>>>>>> sql(DDL + > >>>>>>>>>> DML) to construct their job, In order to improve user > experience, > >> It > >>>>>>>>>> may be > >>>>>>>>>> important for flink, not only providing as many connector jar in > >>>>>>>>>> distribution as possible especially the connector and format we > >> have > >>>>>>>>>> well > >>>>>>>>>> documented, but also providing an mechanism to load connectors > >>>>>>>>>> according > >>>>>>>>>> to the DDLs, > >>>>>>>>>> > >>>>>>>>>> So I think it could be good to place connector/format jars in > some > >>>>>>>>>> dir like > >>>>>>>>>> opt/connector which would not affect jobs by default, and > >> introduce > >>>> a > >>>>>>>>>> mechanism of dynamic discovery for SQL. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Wenlong > >>>>>>>>>> > >>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li < > jingsongl...@gmail.com > >>> > >>>> < > >>>>>>>>> jingsongl...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I am thinking both "improve first experience" and "improve > >>>> production > >>>>>>>>>> experience". > >>>>>>>>>> > >>>>>>>>>> I'm thinking about what's the common mode of Flink? > >>>>>>>>>> Streaming job use Kafka? Batch job use Hive? > >>>>>>>>>> > >>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive > server > >>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 > dependency. > >>>>>>>>>> Flink is currently mainly used for streaming, so let's not talk > >>>>>>>>>> about hive. > >>>>>>>>>> > >>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is > (related > >> to > >>>>>>>>>> connectors): > >>>>>>>>>> - ETL jobs: Kafka -> Kafka > >>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka > >>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink > >>>>>>>>>> So Kafka and JDBC are probably the most commonly used. Of > course, > >>>>>>>>>> > >>>>>>>>>> also > >>>>>>>>>> > >>>>>>>>>> includes CSV, JSON's formats. > >>>>>>>>>> So when we provide such a fat distribution: > >>>>>>>>>> - With CSV, JSON. > >>>>>>>>>> - With flink-kafka-universal and kafka dependencies. > >>>>>>>>>> - With flink-jdbc. > >>>>>>>>>> Using this fat distribution, most users can run their jobs well. > >>>>>>>>>> > >>>>>>>>>> (jdbc > >>>>>>>>>> > >>>>>>>>>> driver jar required, but this is very natural to do) > >>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka > may > >>>>>>>>>> > >>>>>>>>>> have > >>>>>>>>>> > >>>>>>>>>> conflicts, but if our goal is to use kafka-universal to support > >> all > >>>>>>>>>> Kafka > >>>>>>>>>> versions, it is hopeful to target the vast majority of users. > >>>>>>>>>> > >>>>>>>>>> We don't want to plug all jars into the fat distribution. Only > >> need > >>>>>>>>>> less > >>>>>>>>>> conflict and common. of course, it is a matter of consideration > to > >>>>>>>>>> > >>>>>>>>>> put > >>>>>>>>>> > >>>>>>>>>> which jar into fat distribution. > >>>>>>>>>> We have the opportunity to facilitate the majority of users, but > >>>>>>>>>> also left > >>>>>>>>>> opportunities for customization. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Jingsong Lee > >>>>>>>>>> > >>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <imj...@gmail.com> < > >>>>>>>>> imj...@gmail.com> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> I think we should first reach an consensus on "what problem do > we > >>>>>>>>>> want to > >>>>>>>>>> solve?" > >>>>>>>>>> (1) improve first experience? or (2) improve production > >> experience? > >>>>>>>>>> > >>>>>>>>>> As far as I can see, with the above discussion, I think what we > >>>>>>>>>> want to > >>>>>>>>>> solve is the "first experience". > >>>>>>>>>> And I think the slim jar is still the best distribution for > >>>>>>>>>> production, > >>>>>>>>>> because it's easier to assembling jars > >>>>>>>>>> than excluding jars and can avoid potential class conflicts. > >>>>>>>>>> > >>>>>>>>>> If we want to improve "first experience", I think it make sense > to > >>>>>>>>>> have a > >>>>>>>>>> fat distribution to give users a more smooth first experience. > >>>>>>>>>> But I would like to call it "playground distribution" or > something > >>>>>>>>>> like > >>>>>>>>>> that to explicitly differ from the "slim production-purpose > >>>>>>>>>> > >>>>>>>>>> distribution". > >>>>>>>>>> > >>>>>>>>>> The "playground distribution" can contains some widely used > jars, > >>>>>>>>>> > >>>>>>>>>> like > >>>>>>>>>> > >>>>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, > avro, > >>>>>>>>>> json, > >>>>>>>>>> csv, etc.. > >>>>>>>>>> Even we can provide a playground docker which may contain the > fat > >>>>>>>>>> distribution, python3, and hive. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Jark > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < > >> ches...@apache.org> > >>>> < > >>>>>>>>> ches...@apache.org> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> I don't see a lot of value in having multiple distributions. > >>>>>>>>>> > >>>>>>>>>> The simple reality is that no fat distribution we could provide > >>>>>>>>>> > >>>>>>>>>> would > >>>>>>>>>> > >>>>>>>>>> satisfy all use-cases, so why even try. > >>>>>>>>>> If users commonly run into issues for certain jars, then maybe > >>>>>>>>>> > >>>>>>>>>> those > >>>>>>>>>> > >>>>>>>>>> should be added to the current distribution. > >>>>>>>>>> > >>>>>>>>>> Personally though I still believe we should only distribute a > slim > >>>>>>>>>> version. I'd rather have users always add required jars to the > >>>>>>>>>> distribution than only when they go outside our "expected" > >>>>>>>>>> > >>>>>>>>>> use-cases. > >>>>>>>>>> > >>>>>>>>>> Then we might finally address this issue properly, i.e., tooling > >> to > >>>>>>>>>> assemble custom distributions and/or better error messages if > >>>>>>>>>> Flink-provided extensions cannot be found. > >>>>>>>>>> > >>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote: > >>>>>>>>>> > >>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat" > >>>>>>>>>> > >>>>>>>>>> and > >>>>>>>>>> > >>>>>>>>>> "slim" > >>>>>>>>>> > >>>>>>>>>> solution though. I get the idea > >>>>>>>>>> that we can make the slim one even more lightweight than current > >>>>>>>>>> distribution, but what about the "fat" > >>>>>>>>>> one? Do you mean that we would package all connectors and > formats > >>>>>>>>>> > >>>>>>>>>> into > >>>>>>>>>> > >>>>>>>>>> this? I'm not sure if this is > >>>>>>>>>> feasible. For example, we can't put all versions of kafka and > hive > >>>>>>>>>> connector jars into lib directory, and > >>>>>>>>>> we also might need hadoop jars when using filesystem connector > to > >>>>>>>>>> > >>>>>>>>>> access > >>>>>>>>>> > >>>>>>>>>> data from HDFS. > >>>>>>>>>> > >>>>>>>>>> So my guess would be we might hand-pick some of the most > >>>>>>>>>> > >>>>>>>>>> frequently > >>>>>>>>>> > >>>>>>>>>> used > >>>>>>>>>> > >>>>>>>>>> connectors and formats > >>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above, > >>>>>>>>>> > >>>>>>>>>> and > >>>>>>>>>> > >>>>>>>>>> still > >>>>>>>>>> > >>>>>>>>>> leave some other connectors out of it. > >>>>>>>>>> If this is the case, then why not we just provide this > >>>>>>>>>> > >>>>>>>>>> distribution > >>>>>>>>>> > >>>>>>>>>> to > >>>>>>>>>> > >>>>>>>>>> user? I'm not sure i get the benefit of > >>>>>>>>>> providing another super "slim" jar (we have to pay some costs to > >>>>>>>>>> > >>>>>>>>>> provide > >>>>>>>>>> > >>>>>>>>>> another suit of distribution). > >>>>>>>>>> > >>>>>>>>>> What do you think? > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Kurt > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > >>>>>>>>>> > >>>>>>>>>> jingsongl...@gmail.com > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Big +1. > >>>>>>>>>> > >>>>>>>>>> I like "fat" and "slim". > >>>>>>>>>> > >>>>>>>>>> For csv and json, like Jark said, they are quite small and don't > >>>>>>>>>> > >>>>>>>>>> have > >>>>>>>>>> > >>>>>>>>>> other > >>>>>>>>>> > >>>>>>>>>> dependencies. They are important to kafka connector, and > >>>>>>>>>> > >>>>>>>>>> important > >>>>>>>>>> > >>>>>>>>>> to upcoming file system connector too. > >>>>>>>>>> So can we move them to both "fat" and "slim"? They're so > >>>>>>>>>> > >>>>>>>>>> important, > >>>>>>>>>> > >>>>>>>>>> and > >>>>>>>>>> > >>>>>>>>>> they're so lightweight. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Jingsong Lee > >>>>>>>>>> > >>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <godfre...@gmail.com > > > >> < > >>>>>>>>> godfre...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Big +1. > >>>>>>>>>> This will improve user experience (special for Flink new users). > >>>>>>>>>> We answered so many questions about "class not found". > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Godfrey > >>>>>>>>>> > >>>>>>>>>> Dian Fu <dian0511...@gmail.com> <dian0511...@gmail.com> > >>>> 于2020年4月15日周三 > >>>>>>>>> 下午4:30写道: > >>>>>>>>>> > >>>>>>>>>> +1 to this proposal. > >>>>>>>>>> > >>>>>>>>>> Missing connector jars is also a big problem for PyFlink users. > >>>>>>>>>> > >>>>>>>>>> Currently, > >>>>>>>>>> > >>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has > >>>>>>>>>> > >>>>>>>>>> to > >>>>>>>>>> > >>>>>>>>>> manually > >>>>>>>>>> > >>>>>>>>>> copy the connector fat jars to the PyFlink installation > >>>>>>>>>> > >>>>>>>>>> directory > >>>>>>>>>> > >>>>>>>>>> for > >>>>>>>>>> > >>>>>>>>>> the > >>>>>>>>>> > >>>>>>>>>> connectors to be used if he wants to run jobs locally. This > >>>>>>>>>> > >>>>>>>>>> process > >>>>>>>>>> > >>>>>>>>>> is > >>>>>>>>>> > >>>>>>>>>> very > >>>>>>>>>> > >>>>>>>>>> confuse for users and affects the experience a lot. > >>>>>>>>>> > >>>>>>>>>> Regards, > >>>>>>>>>> Dian > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <imj...@gmail.com> < > imj...@gmail.com> > >>>> 写道: > >>>>>>>>>> +1 to the proposal. I also found the "download additional jar" > >>>>>>>>>> > >>>>>>>>>> step > >>>>>>>>>> > >>>>>>>>>> is > >>>>>>>>>> > >>>>>>>>>> really verbose when I prepare webinars. > >>>>>>>>>> > >>>>>>>>>> At least, I think the flink-csv and flink-json should in the > >>>>>>>>>> > >>>>>>>>>> distribution, > >>>>>>>>>> > >>>>>>>>>> they are quite small and don't have other dependencies. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Jark > >>>>>>>>>> > >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <zjf...@gmail.com> < > >>>>>>>>> zjf...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi Aljoscha, > >>>>>>>>>> > >>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to > >>>>>>>>>> > >>>>>>>>>> put > >>>>>>>>>> > >>>>>>>>>> these > >>>>>>>>>> > >>>>>>>>>> connectors ? opt or lib ? > >>>>>>>>>> > >>>>>>>>>> Aljoscha Krettek <aljos...@apache.org> <aljos...@apache.org> > >>>>>>>>> 于2020年4月15日周三 > >>>>>>>>>> 下午3:30写道: > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Hi Everyone, > >>>>>>>>>> > >>>>>>>>>> I'd like to discuss about releasing a more full-featured > >>>>>>>>>> > >>>>>>>>>> Flink > >>>>>>>>>> > >>>>>>>>>> distribution. The motivation is that there is friction for > >>>>>>>>>> > >>>>>>>>>> SQL/Table > >>>>>>>>>> > >>>>>>>>>> API > >>>>>>>>>> > >>>>>>>>>> users that want to use Table connectors which are not there > >>>>>>>>>> > >>>>>>>>>> in > >>>>>>>>>> > >>>>>>>>>> the > >>>>>>>>>> > >>>>>>>>>> current Flink Distribution. For these users the workflow is > >>>>>>>>>> > >>>>>>>>>> currently > >>>>>>>>>> > >>>>>>>>>> roughly: > >>>>>>>>>> > >>>>>>>>>> - download Flink dist > >>>>>>>>>> - configure csv/Kafka/json connectors per configuration > >>>>>>>>>> - run SQL client or program > >>>>>>>>>> - decrypt error message and research the solution > >>>>>>>>>> - download additional connector jars > >>>>>>>>>> - program works correctly > >>>>>>>>>> > >>>>>>>>>> I realize that this can be made to work but if every SQL > >>>>>>>>>> > >>>>>>>>>> user > >>>>>>>>>> > >>>>>>>>>> has > >>>>>>>>>> > >>>>>>>>>> this > >>>>>>>>>> > >>>>>>>>>> as their first experience that doesn't seem good to me. > >>>>>>>>>> > >>>>>>>>>> My proposal is to provide two versions of the Flink > >>>>>>>>>> > >>>>>>>>>> Distribution > >>>>>>>>>> > >>>>>>>>>> in > >>>>>>>>>> > >>>>>>>>>> the > >>>>>>>>>> > >>>>>>>>>> future: "fat" and "slim" (names to be discussed): > >>>>>>>>>> > >>>>>>>>>> - slim would be even trimmer than todays distribution > >>>>>>>>>> - fat would contain a lot of convenience connectors (yet > >>>>>>>>>> > >>>>>>>>>> to > >>>>>>>>>> > >>>>>>>>>> be > >>>>>>>>>> > >>>>>>>>>> determined which one) > >>>>>>>>>> > >>>>>>>>>> And yes, I realize that there are already more dimensions of > >>>>>>>>>> > >>>>>>>>>> Flink > >>>>>>>>>> > >>>>>>>>>> releases (Scala version and Java version). > >>>>>>>>>> > >>>>>>>>>> For background, our current Flink dist has these in the opt > >>>>>>>>>> > >>>>>>>>>> directory: > >>>>>>>>>> > >>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar > >>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar > >>>>>>>>>> - flink-cep_2.12-1.10.0.jar > >>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar > >>>>>>>>>> - flink-gelly_2.12-1.10.0.jar > >>>>>>>>>> - flink-metrics-datadog-1.10.0.jar > >>>>>>>>>> - flink-metrics-graphite-1.10.0.jar > >>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar > >>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar > >>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar > >>>>>>>>>> - flink-metrics-statsd-1.10.0.jar > >>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar > >>>>>>>>>> - flink-python_2.12-1.10.0.jar > >>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar > >>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar > >>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar > >>>>>>>>>> - > >>>>>>>>>> > >>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > >>>>>>>>>> > >>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar > >>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar > >>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar > >>>>>>>>>> > >>>>>>>>>> Current Flink dist is 267M. If we removed everything from > >>>>>>>>>> > >>>>>>>>>> opt > >>>>>>>>>> > >>>>>>>>>> we > >>>>>>>>>> > >>>>>>>>>> would > >>>>>>>>>> > >>>>>>>>>> go down to 126M. I would reccomend this, because the large > >>>>>>>>>> > >>>>>>>>>> majority > >>>>>>>>>> > >>>>>>>>>> of > >>>>>>>>>> > >>>>>>>>>> the files in opt are probably unused. > >>>>>>>>>> > >>>>>>>>>> What do you think? > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Aljoscha > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Best Regards > >>>>>>>>>> > >>>>>>>>>> Jeff Zhang > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Best, Jingsong Lee > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> Best, Jingsong Lee > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>> > >>>> > >> > >> > > > >