If you're working with Dataflow, it supports this flag: https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602 which uses guppy for heap profiling.
On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <[email protected]> wrote: > Even tough the algorithm works on your batch system, did you verify > anything that can rule out the possibility where it is the underlying ML > package causing the memory leak? > > If not, maybe replace your prediction with a dummy function which does not > load any model at all, and always just give the same prediction. Then do > the same plotting, let us see what it looks like. And a plus with version > two: still a dummy prediction, but with model loaded. Given we don't > have much clue at this stage, at least this probably can give us more > confidence in whether it is the underlying ML package causing the issue, or > from beam sdk. just my 2 cents. > > > On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <[email protected]> wrote: > >> Thanks for responding Ruoyun, >> >> We are not sure yet who is causing the leak, but once we run out of the >> memory then sdk worker crashes and pipeline is forced to restart. Check the >> memory usage patterns in the attached image. Each line in that graph is >> representing one task manager host. >> You are right we are running the models for predictions. >> >> Here are few observations: >> >> 1. All the tasks manager memory usage climb over time but some of the >> task managers' memory climb really fast because they are running the ML >> models. These models are definitely using memory intensive data structure >> (pandas data frame etc) hence their memory usage climb really fast. >> 2. We had almost the same code running in different infrastructure >> (non-streaming) that doesn't cause any memory issue. >> 3. Even when the pipeline has restarted, the memory is not released. It >> is still hogged by something. You can notice in the attached image that >> pipeline restarted around 13:30. At that time it is definitely released >> some portion of the memory but didn't completely released all memory. >> Notice that, when the pipeline was originally started, it started with 30% >> of the memory but when got restarted by the job manager it started with 60% >> of the memory. >> >> >> >> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <[email protected]> wrote: >> >>> trying to understand the situation you are having. >>> >>> By saying 'kills the appllication', is that a leak in the application >>> itself, or the workers being the root cause? Also are you running ML >>> models inside Python SDK DoFn's? Then I suppose it is running some >>> predictions rather than model training? >>> >>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <[email protected]> >>> wrote: >>> >>>> I am using *Beam Python SDK *to run my app in production. The app is >>>> running machine learning models. I am noticing some memory leak which >>>> eventually kills the application. I am not sure the source of memory leak. >>>> Currently, I am using object graph >>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory >>>> stats. I hope I will get some useful information out of this. I have also >>>> looked into Guppy library <https://pypi.org/project/guppy/> and they >>>> are almost the same. >>>> >>>> Do you guys have any recommendation for debugging this issue? Do we >>>> have any tooling in the SDK that can help to debug it? >>>> Please feel free to share your experience if you have debugged similar >>>> issues in past. >>>> >>>> Thank you, >>>> Rakesh >>>> >>> >>> >>> -- >>> ================ >>> Ruoyun Huang >>> >>> > > -- > ================ > Ruoyun Huang > >
smime.p7s
Description: S/MIME Cryptographic Signature
