On Mon, Mar 13, 2023 at 12:32 PM wilsonny...@gmail.com
<wilsonny...@gmail.com> wrote:
>
> Thank you Robert.
>
> Yeah, I agree using flink can work in this case. However, in our environment, 
> we are using ray beam runner which will take advantage of Ray framework.
>
> Internally, ray beam runner is using FnApiRunner to run stages and bundles. 
> After some code walkthrough, I am wondering how FnApiRunner is loading all 
> data into the memory in latest code.
>
> In the old FnApiRunner code, it will do one iteration of different stages and 
> I can understand this will load all data into memory because the first stage 
> needs to be finished before the second stage executes. In the latest 
> FnApiRunner code, it seems like it is now using a different approach and it 
> is no longer executing stage by stage. So my question is, in the latest code, 
> will it still be true that all data will be loaded into the memory?

There was an (incomplete) refactor to be more amenable to streaming,
but the execution for batch is still, essentially, stage-by-stage
keeping all the intermediate data in memory.

> Regards
>
> -------------------------------------------------------------
>
> Wilson(Xiaoshuang) Wang
> Sr. Software Engineer
>
>
> On Mon, Mar 13, 2023 at 12:13 PM Robert Bradshaw via dev 
> <dev@beam.apache.org> wrote:
>>
>> The FnApiRunner is primarily for tiny jobs (development and testing)
>> and holds all the data in memory. You'll likely have to run with a
>> "real" runner to operate over datasets of this size. If you want to
>> run locally, you can pass --runner=FlinkRunner and (assuming you have
>> Java installed) it will run your job on a local Flink cluster which
>> might be a good option if you want to keep things on a single machine.
>>
>> On Mon, Mar 13, 2023 at 11:51 AM wilsonny...@gmail.com
>> <wilsonny...@gmail.com> wrote:
>> >
>> > Let me provide more details.
>> >
>> > We are running TFX and we specified beam FnApiRunner as the underlying 
>> > runner type.
>> >
>> > Our dataset is a large amount of HDFS files, each around 200MB and the 
>> > total are around 200GB.
>> >
>> > When running our TFX code, we saw OOM issue.  I assume this is due to Beam 
>> > FnApiRunner loading all the data while executing each stage one by one.
>> >
>> > Regards
>> >
>> > -------------------------------------------------------------
>> >
>> > Wilson(Xiaoshuang) Wang
>> > Sr. Software Engineer
>> >
>> >
>> > On Mon, Mar 13, 2023 at 11:32 AM wilsonny...@gmail.com 
>> > <wilsonny...@gmail.com> wrote:
>> >>
>> >> Python Beam direct runner.
>> >>
>> >>
>> >> Regards
>> >>
>> >> -------------------------------------------------------------
>> >>
>> >> Wilson(Xiaoshuang) Wang
>> >> Sr. Software Engineer
>> >>
>> >>
>> >> On Mon, Mar 13, 2023 at 11:29 AM Robert Burke <rob...@frantil.com> wrote:
>> >>>
>> >>> Which direct runner? They are language specific.
>> >>>
>> >>> On Mon, Mar 13, 2023, 11:27 AM wilsonny...@gmail.com 
>> >>> <wilsonny...@gmail.com> wrote:
>> >>>>
>> >>>> Hi guys,
>> >>>>
>> >>>> We are trying to run our pipeline using direct runner and the input 
>> >>>> dataset is a large amount of HDFS files (few hundred of GB data)
>> >>>>
>> >>>> We experienced OOM issue crash. Then inside the direct runner document, 
>> >>>> I realized direct runner loads the whole dataset into the memory.
>> >>>>
>> >>>> Is there any way we can avoid this OOM issue?
>> >>>>
>> >>>> Regards
>> >>>>
>> >>>> -------------------------------------------------------------
>> >>>>
>> >>>> Wilson(Xiaoshuang) Wang
>> >>>> Sr. Software Engineer

Reply via email to