Thank you Robert.

Yeah, I agree using flink can work in this case. However, in our
environment, we are using ray beam runner which will take advantage of Ray
framework.

Internally, ray beam runner is using FnApiRunner to run stages and bundles.
After some code walkthrough, I am wondering how FnApiRunner is loading all
data into the memory in latest code.

In the old FnApiRunner code, it will do one iteration of different stages
and I can understand this will load all data into memory because the first
stage needs to be finished before the second stage executes. In the latest
FnApiRunner code, it seems like it is now using a different approach and it
is no longer executing stage by stage. So my question is, in the latest
code, will it still be true that all data will be loaded into the memory?



Regards

-------------------------------------------------------------

Wilson(Xiaoshuang) Wang
Sr. Software Engineer


On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev <
[email protected]> wrote:

> The FnApiRunner is primarily for tiny jobs (development and testing)
> and holds all the data in memory. You'll likely have to run with a
> "real" runner to operate over datasets of this size. If you want to
> run locally, you can pass --runner=FlinkRunner and (assuming you have
> Java installed) it will run your job on a local Flink cluster which
> might be a good option if you want to keep things on a single machine.
>
> On Mon, Mar 13, 2023 at 11:51 AM [email protected]
> <[email protected]> wrote:
> >
> > Let me provide more details.
> >
> > We are running TFX and we specified beam FnApiRunner as the underlying
> runner type.
> >
> > Our dataset is a large amount of HDFS files, each around 200MB and the
> total are around 200GB.
> >
> > When running our TFX code, we saw OOM issue.  I assume this is due to
> Beam FnApiRunner loading all the data while executing each stage one by one.
> >
> > Regards
> >
> > -------------------------------------------------------------
> >
> > Wilson(Xiaoshuang) Wang
> > Sr. Software Engineer
> >
> >
> > On Mon, Mar 13, 2023 at 11:32 AM [email protected] <
> [email protected]> wrote:
> >>
> >> Python Beam direct runner.
> >>
> >>
> >> Regards
> >>
> >> -------------------------------------------------------------
> >>
> >> Wilson(Xiaoshuang) Wang
> >> Sr. Software Engineer
> >>
> >>
> >> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <[email protected]>
> wrote:
> >>>
> >>> Which direct runner? They are language specific.
> >>>
> >>> On Mon, Mar 13, 2023, 11:27 AM [email protected] <
> [email protected]> wrote:
> >>>>
> >>>> Hi guys,
> >>>>
> >>>> We are trying to run our pipeline using direct runner and the input
> dataset is a large amount of HDFS files (few hundred of GB data)
> >>>>
> >>>> We experienced OOM issue crash. Then inside the direct runner
> document, I realized direct runner loads the whole dataset into the memory.
> >>>>
> >>>> Is there any way we can avoid this OOM issue?
> >>>>
> >>>> Regards
> >>>>
> >>>> -------------------------------------------------------------
> >>>>
> >>>> Wilson(Xiaoshuang) Wang
> >>>> Sr. Software Engineer
>

Reply via email to