I have been seeing a wide range of Spark users from Spark 2.2 to Spark3, so I 
am expecting Flink could be similar once we have more Flink users onboard. I 
also agree with Danny that we should work together to make the Flink writer 
production ready for the large data flow asap. 

How about we do the same thing like Spark, to have the application code in 
hudi-flink, hudi-flink-1.11, hudi-flink1.12? The table and index should still 
stay in the hudi-flink-client. If hudi-flink-client needs to include any 
version specified feature of Flink, we can discuss the detail on the RFC. WDTY?

Best Regards,
Gary Li
 

On 1/7/21, 8:12 PM, "vino yang" <yanghua1...@gmail.com> wrote:

    +1 on Gary's opinion,

    Yes, the public APIs that come from AbstractHoodieWriteClient should be
    able to reuse.

    We could try to make the HoodieFlinkWriteClient a common implementation.

    IIUC, there is a mapping like this:

    SparkRDDWriteClient -> HoodieFlinkWriteClient
    HoodieDeltaStreamer -> HoodieFlinkStreamer (it could be multiple?)

    Actually, I and Danny's divergence is that we need one HoodieFlinkStreamer
    or two HoodieFlinkStreamers.

    We can maintain one or two, although, we both try to find a good way to
    maintain one app (entry-point).

    Correct me, if I am wrong.

    Best,
    Vino


    Gary Li <garyli1...@outlook.com> 于2021年1月7日周四 下午4:31写道:

    > Hi all,
    >
    > IIUC the current flink writer is like an app, just like the delta
    > streamer. If we want to build another Flink writer, we can still share the
    > same flink client right? Does the flink client also have to use the new
    > feature only available on Flink 1.12?
    >
    > Thanks,
    > Gary Li
    > ________________________________
    > From: Danny Chan <danny0...@apache.org>
    > Sent: Thursday, January 7, 2021 10:19 AM
    > To: dev@hudi.apache.org <dev@hudi.apache.org>
    > Subject: Re: Re: [DISCUSS] New Flink Writer Proposal
    >
    > Thanks vino yang ~
    >
    > IMO, we should not put much put too much energy for current Flink writer,
    > it is not production-ready in the long run. There are so many features 
need
    > to add/support for the Flink write/read(MOR write, COR read, MOR read, the
    > new index), we should focus on one version first, make it robust.
    >
    > I really hope that we can work together to make the writer 
production-ready
    > as soon as possible, it is competitive that we have competitors like 
Apache
    > Iceberg and Delta lake, so from this perspective, there is no benefit to 
be
    > compatible with the current version writer.
    >
    > My idea is that i propose the new infrastructure first as quickly as
    > possible(the basic pipeline, the test framework.), and then we can work
    > together for the new version (MOR write, COR read, MOR read, the new
    > index), we better not distract from promote the old writer.
    >
    > What do you think?
    >
    > vino yang <yanghua1...@gmail.com> 于2021年1月6日周三 下午2:14写道:
    >
    > > Hi Danny,
    > >
    > > As we discussed in the doc, we should agree on if we should be 
compatible
    > > with the version less than Flink 1.11/1.12.
    > >
    > > We all know that there are some bottlenecks in the current plan. You
    > > proposed some improvements, yes it is great, but it radically uses the
    > > newer features provided by Flink. It is a pity that some users of old
    > > versions of Flink have no way to benefit from these features.
    > >
    > > The information I can provide is that some users have already used the
    > > current Flink write client or its improved version in a production
    > > environment. For example, SF Technology, and the Flink versions they use
    > > are 1.8.x and 1.10.x.
    > >
    > > Therefore, I personally suggest that there are two options:
    > >
    > > 1) The new design takes into account users of lower versions as much as
    > > possible and maintains a client version;
    > > 2) The new design is based on the features of the new version and 
evolves
    > > separately from the old version(we also have a plan to optimize the
    > current
    > > implementation), but the public abstraction can be reused. I think it is
    > > not impossible to maintain multiple versions. Flink used to support 4+
    > > versions (0.8.2, 0.9, 0.10, 0.11, universal connector) for Kafka
    > Connector,
    > > but they share the same code base.
    > >
    > > Any thoughts and opinions are welcome and appreciated.
    > >
    > > Best,
    > > Vino
    > >
    > > vino yang <yanghua1...@gmail.com> 于2021年1月6日周三 下午1:37写道:
    > >
    > > > Hi Danny,
    > > >
    > > > You should have cwiki edit permission now.
    > > > Any problems let me know.
    > > >
    > > > Best,
    > > > Vino
    > > >
    > > > Danny Chan <danny0...@apache.org> 于2021年1月6日周三 下午12:05写道:
    > > >
    > > >> Sorry ~
    > > >>
    > > >> Forget to say that my Confluence ID is danny0405.
    > > >>
    > > >> It would be nice if any of you can help on this.
    > > >>
    > > >> Best,
    > > >> Danny Chan
    > > >>
    > > >> Danny Chan <danny0...@apache.org> 于2021年1月6日周三 下午12:00写道:
    > > >>
    > > >> > Hi, can someone give me the CWIKI permission so that i can update
    > the
    > > >> > design details to that (maybe as a new RFC though ~).
    > > >> >
    > > >> > wangxianghu <wxhj...@126.com> 于2021年1月5日周二 下午2:43写道:
    > > >> >
    > > >> >> + 1, Thanks Danny!
    > > >> >> I believe this new feature OperatorConrdinator in flink-1.11 will
    > > help
    > > >> >> improve the current implementation
    > > >> >>
    > > >> >> Best,
    > > >> >>
    > > >> >> XianghuWang
    > > >> >>
    > > >> >> At 2021-01-05 14:17:37, "vino yang" <yanghua1...@gmail.com> wrote:
    > > >> >> >Hi,
    > > >> >> >
    > > >> >> >Sharing more details, the OperatorConrdinator is the part of the
    > new
    > > >> Data
    > > >> >> >Source API(Beta) involved in the Flink 1.11's release note[1].
    > > >> >> >
    > > >> >> >Flink 1.11 was released only about half a year ago. The design of
    > > >> RFC-13
    > > >> >> >began at the end of 2019, and most of the implementation was
    > > completed
    > > >> >> when
    > > >> >> >Flink 1.11 was released.
    > > >> >> >
    > > >> >> >I believe that the production environment of many large companies
    > > has
    > > >> not
    > > >> >> >been upgraded so quickly (As far as our company is concerned, we
    > > still
    > > >> >> have
    > > >> >> >some jobs running on flink release packages below 1.9).
    > > >> >> >
    > > >> >> >So, maybe we need to find a mechanism to benefit both new and old
    > > >> users.
    > > >> >> >
    > > >> >> >[1]:
    > > >> >> >
    > > >> >>
    > > >>
    > >
    > 
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fflink.apache.org%2Fnews%2F2020%2F07%2F06%2Frelease-1.11.0.html%23new-data-source-api-beta&amp;data=04%7C01%7C%7C5ce33f4da3b7421d096108d8b3057bc8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637456183424591944%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=PNttPwYb6PRxMsOQuaIPXErQML1y6RgLGkUzFqgPZfE%3D&amp;reserved=0
    > > >> >> >
    > > >> >> >Best,
    > > >> >> >Vino
    > > >> >> >
    > > >> >> >vino yang <yanghua1...@gmail.com> 于2021年1月5日周二 下午12:30写道:
    > > >> >> >
    > > >> >> >> Hi,
    > > >> >> >>
    > > >> >> >> +1, thank you Danny for introducing this new feature
    > > >> >> >> (OperatorCoordinator)[1] of Flink in the recently latest
    > version.
    > > >> >> >> This feature is very helpful for improving the implementation
    > > >> >> mechanism of
    > > >> >> >> Flink write-client.
    > > >> >> >>
    > > >> >> >> But this feature is only available after Flink 1.11. Before
    > that,
    > > >> there
    > > >> >> >> was no good way to realize the mechanism of task upstream and
    > > >> >> downstream
    > > >> >> >> coordination through the public API provided by Flink.
    > > >> >> >> I just have a concern, whether we need to take into account the
    > > >> users
    > > >> >> of
    > > >> >> >> earlier versions (less than Flink 1.11).
    > > >> >> >>
    > > >> >> >> [1]:
    > 
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-15099&amp;data=04%7C01%7C%7C5ce33f4da3b7421d096108d8b3057bc8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637456183424591944%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=92Imz%2F%2FI5TP25%2FkxxOsWeHKExyKf9A3JCXIKXE5IHJw%3D&amp;reserved=0
    > > >> >> >>
    > > >> >> >> Best,
    > > >> >> >> Vino
    > > >> >> >>
    > > >> >> >> Gary Li <garyli1...@outlook.com> 于2021年1月5日周二 上午10:40写道:
    > > >> >> >>
    > > >> >> >>> Hi Danny,
    > > >> >> >>>
    > > >> >> >>> Thanks for the proposal. I'd recommend starting a new RFC.
    > RFC-13
    > > >> was
    > > >> >> >>> done and including some work about the refactoring so we 
should
    > > >> mark
    > > >> >> it as
    > > >> >> >>> completed. Looking forward to having further discussion on the
    > > RFC.
    > > >> >> >>>
    > > >> >> >>> Best,
    > > >> >> >>> Gary Li
    > > >> >> >>> ________________________________
    > > >> >> >>> From: Danny Chan <danny0...@apache.org>
    > > >> >> >>> Sent: Tuesday, January 5, 2021 10:22 AM
    > > >> >> >>> To: dev@hudi.apache.org <dev@hudi.apache.org>
    > > >> >> >>> Subject: Re: [DISCUSS] New Flink Writer Proposal
    > > >> >> >>>
    > > >> >> >>> Sure, i can update the RFC-13 cwiki if you agree with that.
    > > >> >> >>>
    > > >> >> >>> Vinoth Chandar <vin...@apache.org> 于2021年1月5日周二 上午2:58写道:
    > > >> >> >>>
    > > >> >> >>> > Overall +1 on the idea.
    > > >> >> >>> >
    > > >> >> >>> > Danny, could we move this to the apache cwiki if you don't
    > > mind?
    > > >> >> >>> > That's what we have been using for other RFC discussions.
    > > >> >> >>> >
    > > >> >> >>> > On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <
    > > danny0...@apache.org>
    > > >> >> wrote:
    > > >> >> >>> >
    > > >> >> >>> > > The RFC-13 Flink writer has some bottlenecks that make it
    > > hard
    > > >> to
    > > >> >> >>> adapter
    > > >> >> >>> > > to production:
    > > >> >> >>> > >
    > > >> >> >>> > > - The InstantGeneratorOperator is parallelism 1, which is 
a
    > > >> limit
    > > >> >> for
    > > >> >> >>> > > high-throughput consumption; because all the split inputs
    > > drain
    > > >> >> to a
    > > >> >> >>> > single
    > > >> >> >>> > > thread, the network IO would gains pressure too
    > > >> >> >>> > > - The WriteProcessOperator handles inputs by partition,
    > that
    > > >> >> means,
    > > >> >> >>> > within
    > > >> >> >>> > > each partition write process, the BUCKETs are written one
    > by
    > > >> one,
    > > >> >> the
    > > >> >> >>> > FILE
    > > >> >> >>> > > IO is limit to adapter to high-throughput inputs
    > > >> >> >>> > > - It buffers the data by checkpoints, which is too hard to
    > be
    > > >> >> robust
    > > >> >> >>> for
    > > >> >> >>> > > production, the checkpoint function is blocking and should
    > > not
    > > >> >> have IO
    > > >> >> >>> > > operations.
    > > >> >> >>> > > - The FlinkHoodieIndex is only valid for a per-job scope,
    > it
    > > >> does
    > > >> >> not
    > > >> >> >>> > work
    > > >> >> >>> > > for existing bootstrap data or for different Flink jobs
    > > >> >> >>> > >
    > > >> >> >>> > > Thus, here I propose a new design for the Flink writer to
    > > solve
    > > >> >> these
    > > >> >> >>> > > problems[1]. Overall, the new design tries to remove the
    > > single
    > > >> >> >>> > parallelism
    > > >> >> >>> > > operators and make the index more powerful and scalable.
    > > >> >> >>> > >
    > > >> >> >>> > > I plan to solve these bottlenecks incrementally (4 steps),
    > > >> there
    > > >> >> are
    > > >> >> >>> > > already some local POCs for these proposals.
    > > >> >> >>> > >
    > > >> >> >>> > > I'm looking forward to your feedback. Any suggestions are
    > > >> >> appreciated
    > > >> >> >>> ~
    > > >> >> >>> > >
    > > >> >> >>> > > [1]
    > > >> >> >>> > >
    > > >> >> >>> > >
    > > >> >> >>> >
    > > >> >> >>>
    > > >> >>
    > > >>
    > >
    > 
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdocs.google.com%2Fdocument%2Fd%2F1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I%2Fedit%3Fusp%3Dsharing&amp;data=04%7C01%7C%7C5ce33f4da3b7421d096108d8b3057bc8%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637456183424601900%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Pf9Ep8D7x2Q7Es7g0dZd1hJxE5ib0LYtVTSiz2COVzw%3D&amp;reserved=0
    > > >> >> >>> > >
    > > >> >> >>> >
    > > >> >> >>>
    > > >> >> >>
    > > >> >>
    > > >> >
    > > >>
    > > >
    > >
    >

Reply via email to