The FnApiRunner is primarily for tiny jobs (development and testing) and holds all the data in memory. You'll likely have to run with a "real" runner to operate over datasets of this size. If you want to run locally, you can pass --runner=FlinkRunner and (assuming you have Java installed) it will run your job on a local Flink cluster which might be a good option if you want to keep things on a single machine.
On Mon, Mar 13, 2023 at 11:51 AM [email protected] <[email protected]> wrote: > > Let me provide more details. > > We are running TFX and we specified beam FnApiRunner as the underlying runner > type. > > Our dataset is a large amount of HDFS files, each around 200MB and the total > are around 200GB. > > When running our TFX code, we saw OOM issue. I assume this is due to Beam > FnApiRunner loading all the data while executing each stage one by one. > > Regards > > ------------------------------------------------------------- > > Wilson(Xiaoshuang) Wang > Sr. Software Engineer > > > On Mon, Mar 13, 2023 at 11:32 AM [email protected] > <[email protected]> wrote: >> >> Python Beam direct runner. >> >> >> Regards >> >> ------------------------------------------------------------- >> >> Wilson(Xiaoshuang) Wang >> Sr. Software Engineer >> >> >> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <[email protected]> wrote: >>> >>> Which direct runner? They are language specific. >>> >>> On Mon, Mar 13, 2023, 11:27 AM [email protected] >>> <[email protected]> wrote: >>>> >>>> Hi guys, >>>> >>>> We are trying to run our pipeline using direct runner and the input >>>> dataset is a large amount of HDFS files (few hundred of GB data) >>>> >>>> We experienced OOM issue crash. Then inside the direct runner document, I >>>> realized direct runner loads the whole dataset into the memory. >>>> >>>> Is there any way we can avoid this OOM issue? >>>> >>>> Regards >>>> >>>> ------------------------------------------------------------- >>>> >>>> Wilson(Xiaoshuang) Wang >>>> Sr. Software Engineer
