Thank you Robert for the clarification.
Regards ------------------------------------------------------------- Wilson(Xiaoshuang) Wang Sr. Software Engineer On Mon, Mar 13, 2023 at 1:34 PM Robert Bradshaw <rober...@google.com> wrote: > On Mon, Mar 13, 2023 at 12:32 PM wilsonny...@gmail.com > <wilsonny...@gmail.com> wrote: > > > > Thank you Robert. > > > > Yeah, I agree using flink can work in this case. However, in our > environment, we are using ray beam runner which will take advantage of Ray > framework. > > > > Internally, ray beam runner is using FnApiRunner to run stages and > bundles. After some code walkthrough, I am wondering how FnApiRunner is > loading all data into the memory in latest code. > > > > In the old FnApiRunner code, it will do one iteration of different > stages and I can understand this will load all data into memory because the > first stage needs to be finished before the second stage executes. In the > latest FnApiRunner code, it seems like it is now using a different approach > and it is no longer executing stage by stage. So my question is, in the > latest code, will it still be true that all data will be loaded into the > memory? > > There was an (incomplete) refactor to be more amenable to streaming, > but the execution for batch is still, essentially, stage-by-stage > keeping all the intermediate data in memory. > > > Regards > > > > ------------------------------------------------------------- > > > > Wilson(Xiaoshuang) Wang > > Sr. Software Engineer > > > > > > On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev < > dev@beam.apache.org> wrote: > >> > >> The FnApiRunner is primarily for tiny jobs (development and testing) > >> and holds all the data in memory. You'll likely have to run with a > >> "real" runner to operate over datasets of this size. If you want to > >> run locally, you can pass --runner=FlinkRunner and (assuming you have > >> Java installed) it will run your job on a local Flink cluster which > >> might be a good option if you want to keep things on a single machine. > >> > >> On Mon, Mar 13, 2023 at 11:51 AM wilsonny...@gmail.com > >> <wilsonny...@gmail.com> wrote: > >> > > >> > Let me provide more details. > >> > > >> > We are running TFX and we specified beam FnApiRunner as the > underlying runner type. > >> > > >> > Our dataset is a large amount of HDFS files, each around 200MB and > the total are around 200GB. > >> > > >> > When running our TFX code, we saw OOM issue. I assume this is due to > Beam FnApiRunner loading all the data while executing each stage one by one. > >> > > >> > Regards > >> > > >> > ------------------------------------------------------------- > >> > > >> > Wilson(Xiaoshuang) Wang > >> > Sr. Software Engineer > >> > > >> > > >> > On Mon, Mar 13, 2023 at 11:32 AM wilsonny...@gmail.com < > wilsonny...@gmail.com> wrote: > >> >> > >> >> Python Beam direct runner. > >> >> > >> >> > >> >> Regards > >> >> > >> >> ------------------------------------------------------------- > >> >> > >> >> Wilson(Xiaoshuang) Wang > >> >> Sr. Software Engineer > >> >> > >> >> > >> >> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <rob...@frantil.com> > wrote: > >> >>> > >> >>> Which direct runner? They are language specific. > >> >>> > >> >>> On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com < > wilsonny...@gmail.com> wrote: > >> >>>> > >> >>>> Hi guys, > >> >>>> > >> >>>> We are trying to run our pipeline using direct runner and the > input dataset is a large amount of HDFS files (few hundred of GB data) > >> >>>> > >> >>>> We experienced OOM issue crash. Then inside the direct runner > document, I realized direct runner loads the whole dataset into the memory. > >> >>>> > >> >>>> Is there any way we can avoid this OOM issue? > >> >>>> > >> >>>> Regards > >> >>>> > >> >>>> ------------------------------------------------------------- > >> >>>> > >> >>>> Wilson(Xiaoshuang) Wang > >> >>>> Sr. Software Engineer >