Thank you Robert for the clarification.


Regards

-------------------------------------------------------------

Wilson(Xiaoshuang) Wang
Sr. Software Engineer


On Mon, Mar 13, 2023 at 1:34 PM Robert Bradshaw <rober...@google.com> wrote:

> On Mon, Mar 13, 2023 at 12:32 PM wilsonny...@gmail.com
> <wilsonny...@gmail.com> wrote:
> >
> > Thank you Robert.
> >
> > Yeah, I agree using flink can work in this case. However, in our
> environment, we are using ray beam runner which will take advantage of Ray
> framework.
> >
> > Internally, ray beam runner is using FnApiRunner to run stages and
> bundles. After some code walkthrough, I am wondering how FnApiRunner is
> loading all data into the memory in latest code.
> >
> > In the old FnApiRunner code, it will do one iteration of different
> stages and I can understand this will load all data into memory because the
> first stage needs to be finished before the second stage executes. In the
> latest FnApiRunner code, it seems like it is now using a different approach
> and it is no longer executing stage by stage. So my question is, in the
> latest code, will it still be true that all data will be loaded into the
> memory?
>
> There was an (incomplete) refactor to be more amenable to streaming,
> but the execution for batch is still, essentially, stage-by-stage
> keeping all the intermediate data in memory.
>
> > Regards
> >
> > -------------------------------------------------------------
> >
> > Wilson(Xiaoshuang) Wang
> > Sr. Software Engineer
> >
> >
> > On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev <
> dev@beam.apache.org> wrote:
> >>
> >> The FnApiRunner is primarily for tiny jobs (development and testing)
> >> and holds all the data in memory. You'll likely have to run with a
> >> "real" runner to operate over datasets of this size. If you want to
> >> run locally, you can pass --runner=FlinkRunner and (assuming you have
> >> Java installed) it will run your job on a local Flink cluster which
> >> might be a good option if you want to keep things on a single machine.
> >>
> >> On Mon, Mar 13, 2023 at 11:51 AM wilsonny...@gmail.com
> >> <wilsonny...@gmail.com> wrote:
> >> >
> >> > Let me provide more details.
> >> >
> >> > We are running TFX and we specified beam FnApiRunner as the
> underlying runner type.
> >> >
> >> > Our dataset is a large amount of HDFS files, each around 200MB and
> the total are around 200GB.
> >> >
> >> > When running our TFX code, we saw OOM issue.  I assume this is due to
> Beam FnApiRunner loading all the data while executing each stage one by one.
> >> >
> >> > Regards
> >> >
> >> > -------------------------------------------------------------
> >> >
> >> > Wilson(Xiaoshuang) Wang
> >> > Sr. Software Engineer
> >> >
> >> >
> >> > On Mon, Mar 13, 2023 at 11:32 AM wilsonny...@gmail.com <
> wilsonny...@gmail.com> wrote:
> >> >>
> >> >> Python Beam direct runner.
> >> >>
> >> >>
> >> >> Regards
> >> >>
> >> >> -------------------------------------------------------------
> >> >>
> >> >> Wilson(Xiaoshuang) Wang
> >> >> Sr. Software Engineer
> >> >>
> >> >>
> >> >> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <rob...@frantil.com>
> wrote:
> >> >>>
> >> >>> Which direct runner? They are language specific.
> >> >>>
> >> >>> On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com <
> wilsonny...@gmail.com> wrote:
> >> >>>>
> >> >>>> Hi guys,
> >> >>>>
> >> >>>> We are trying to run our pipeline using direct runner and the
> input dataset is a large amount of HDFS files (few hundred of GB data)
> >> >>>>
> >> >>>> We experienced OOM issue crash. Then inside the direct runner
> document, I realized direct runner loads the whole dataset into the memory.
> >> >>>>
> >> >>>> Is there any way we can avoid this OOM issue?
> >> >>>>
> >> >>>> Regards
> >> >>>>
> >> >>>> -------------------------------------------------------------
> >> >>>>
> >> >>>> Wilson(Xiaoshuang) Wang
> >> >>>> Sr. Software Engineer
>

Reply via email to