Beam artifact staging currently relies on shared file system and there are
limitations, for example when running locally with Docker and local FS. It
sounds like a distributed cache based implementation might be a good
(better?) option for artifact staging even for the Beam Flink runner?

If so, can the implementation you propose be compatible with the Beam
artifact staging service so that it can be plugged into the Beam Flink
runner?

Thanks,
Thomas


On Mon, Oct 21, 2019 at 2:34 AM jincheng sun <sunjincheng...@gmail.com>
wrote:

> Hi Max,
>
> Sorry for the late reply. Regarding the issue you mentioned above, I'm glad
> to share my thoughts:
>
> > For process-based execution we use Flink's cache distribution instead of
> Beam's artifact staging.
>
> In current design, we use Flink's cache distribution to upload users' files
> from client to cluster in both docker mode and process mode. That is,
> Flink's cache distribution and Beam's artifact staging service work
> together in docker mode.
>
>
> > Do we want to implement two different ways of staging artifacts? It seems
> sensible to use the same artifact staging functionality also for the
> process-based execution.
>
> I agree that the implementation will be simple if we use the same artifact
> staging functionality also for the process-based execution. However, it's
> not the best for performance as it will introduce an additional network
> transmission, as in process mode TaskManager and python worker share the
> same environment, in which case the user files in Flink Distribute Cache
> can be accessed by python worker directly. We do not need the staging
> service in this case.
>
> > Apart from being simpler, this would also allow the process-based
> execution to run in other environments than the Flink TaskManager
> environment.
>
> IMHO, this case is more like docker mode, and we can share or reuse the
> code of Beam docker mode. Furthermore, in this case python worker is
> launched by the operator, so it is always in the same environment as the
> operator.
>
> Thanks again for your feedback, and it is valuable for find out the final
> best architecture.
>
> Feel free to correct me if there is anything incorrect.
>
> Best,
> Jincheng
>
> Maximilian Michels <m...@apache.org> 于2019年10月16日周三 下午4:23写道:
>
> > I'm also late to the party here :) When I saw the first draft, I was
> > thinking how exactly the design doc would tie in with Beam. Thanks for
> > the update.
> >
> > A couple of comments with this regard:
> >
> > > Flink has provided a distributed cache mechanism and allows users to
> > upload their files using "registerCachedFile" method in
> > ExecutionEnvironment/StreamExecutionEnvironment. The python files users
> > specified through "add_python_file", "set_python_requirements" and
> > "add_python_archive" are also uploaded through this method eventually.
> >
> > For process-based execution we use Flink's cache distribution instead of
> > Beam's artifact staging.
> >
> > > Apache Beam Portability Framework already supports artifact staging
> that
> > works out of the box with the Docker environment. We can use the artifact
> > staging service defined in Apache Beam to transfer the dependencies from
> > the operator to Python SDK harness running in the docker container.
> >
> > Do we want to implement two different ways of staging artifacts? It
> > seems sensible to use the same artifact staging functionality also for
> > the process-based execution. Apart from being simpler, this would also
> > allow the process-based execution to run in other environments than the
> > Flink TaskManager environment.
> >
> > Thanks,
> > Max
> >
> > On 15.10.19 11:13, Wei Zhong wrote:
> > > Hi Thomas,
> > >
> > > Thanks a lot for your suggestion!
> > >
> > > As you can see from the section "Goals" that this FLIP focuses on the
> > dependency management in process mode. However, the APIs and design
> > proposed in this FLIP also applies for the docker mode. So it makes sense
> > to me to also describe how this design is integated to the artifact
> staging
> > service of Apache Beam in docker mode. I have updated the design doc and
> > looking forward to your feedback.
> > >
> > > Thanks,
> > > Wei
> > >
> > >> 在 2019年10月15日,01:54,Thomas Weise <t...@apache.org> 写道:
> > >>
> > >> Sorry for joining the discussion late.
> > >>
> > >> The Beam environment already supports artifact staging, it works out
> of
> > the
> > >> box with the Docker environment. I think it would be helpful to
> explain
> > in
> > >> the FLIP how this proposal relates to what Beam offers / how it would
> be
> > >> integrated.
> > >>
> > >> Thanks,
> > >> Thomas
> > >>
> > >>
> > >> On Mon, Oct 14, 2019 at 8:09 AM Jeff Zhang <zjf...@gmail.com> wrote:
> > >>
> > >>> +1
> > >>>
> > >>> Hequn Cheng <chenghe...@gmail.com> 于2019年10月14日周一 下午10:55写道:
> > >>>
> > >>>> +1
> > >>>>
> > >>>> Good job, Wei!
> > >>>>
> > >>>> Best, Hequn
> > >>>>
> > >>>> On Mon, Oct 14, 2019 at 2:54 PM Dian Fu <dian0511...@gmail.com>
> > wrote:
> > >>>>
> > >>>>> Hi Wei,
> > >>>>>
> > >>>>> +1 (non-binding). Thanks for driving this.
> > >>>>>
> > >>>>> Thanks,
> > >>>>> Dian
> > >>>>>
> > >>>>>> 在 2019年10月14日,下午1:40,jincheng sun <sunjincheng...@gmail.com> 写道:
> > >>>>>>
> > >>>>>> +1
> > >>>>>>
> > >>>>>> Wei Zhong <weizhong0...@gmail.com> 于2019年10月12日周六 下午8:41写道:
> > >>>>>>
> > >>>>>>> Hi all,
> > >>>>>>>
> > >>>>>>> I would like to start the vote for FLIP-78[1] which is discussed
> > and
> > >>>>>>> reached consensus in the discussion thread[2].
> > >>>>>>>
> > >>>>>>> The vote will be open for at least 72 hours. I'll try to close it
> > by
> > >>>>>>> 2019-10-16 18:00 UTC, unless there is an objection or not enough
> > >>>> votes.
> > >>>>>>>
> > >>>>>>> Thanks,
> > >>>>>>> Wei
> > >>>>>>>
> > >>>>>>> [1]
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-78%3A+Flink+Python+UDF+Environment+and+Dependency+Management
> > >>>>>>> <
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-78:+Flink+Python+UDF+Environment+and+Dependency+Management
> > >>>>>>>>
> > >>>>>>> [2]
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html
> > >>>>>>> <
> > >>>>>>>
> > >>>>>
> > >>>>
> > >>>
> >
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-Flink-Python-UDF-Environment-and-Dependency-Management-td33514.html
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best Regards
> > >>>
> > >>> Jeff Zhang
> > >>>
> > >
> >
>

Reply via email to