On Fri, Nov 16, 2018 at 3:36 PM Udi Meiri <[email protected]> wrote:

> If you're working with Dataflow, it supports this flag:
> https://github.com/apache/beam/blob/75e9f645c7bec940b87b93f416823b020e4c5f69/sdks/python/apache_beam/options/pipeline_options.py#L602
> which uses guppy for heap profiling.
>

This is really useful flag. Unfortunetly, we are using Beam + Flink.  It
would be really useful to have similar flag for other Streaming engines.


> On Fri, Nov 16, 2018 at 3:08 PM Ruoyun Huang <[email protected]> wrote:
>
>> Even tough the algorithm works on your batch system, did you verify
>> anything that can rule out the possibility where it is the underlying ML
>> package causing the memory leak?
>>
>> If not, maybe replace your prediction with a dummy function which does
>> not load any model at all, and always just give the same prediction. Then
>> do the same plotting, let us see what it looks like. And a plus with
>> version two: still a dummy prediction, but with model loaded.    Given we
>> don't have much clue at this stage, at least this probably can give us more
>> confidence in whether it is the underlying ML package causing the issue, or
>> from beam sdk. just my 2 cents.
>>
>>
>> On Thu, Nov 15, 2018 at 4:54 PM Rakesh Kumar <[email protected]>
>> wrote:
>>
>>> Thanks for responding Ruoyun,
>>>
>>> We are not sure yet who is causing the leak, but once we run out of the
>>> memory then sdk worker crashes and pipeline is forced to restart. Check the
>>> memory usage patterns in the attached image. Each line in that graph is
>>> representing one task manager host.
>>>  You are right we are running the models for predictions.
>>>
>>> Here are few observations:
>>>
>>> 1. All the tasks manager memory usage climb over time but some of the
>>> task managers' memory climb really fast because they are running the ML
>>> models. These models are definitely using memory intensive data structure
>>> (pandas data frame etc) hence their memory usage climb really fast.
>>> 2. We had almost the same code running in different infrastructure
>>> (non-streaming) that doesn't cause any memory issue.
>>> 3. Even when the pipeline has restarted, the memory is not released. It
>>> is still hogged by something. You can notice in the attached image that
>>> pipeline restarted around 13:30. At that time it is definitely released
>>> some portion of the memory but didn't completely released all memory.
>>> Notice that, when the pipeline was originally started, it started with 30%
>>> of the memory but when got restarted by the job manager it started with 60%
>>> of the memory.
>>>
>>>
>>>
>>> On Thu, Nov 15, 2018 at 3:31 PM Ruoyun Huang <[email protected]> wrote:
>>>
>>>> trying to understand the situation you are having.
>>>>
>>>> By saying 'kills the appllication', is that a leak in the application
>>>> itself, or the workers being the root cause?  Also are you running ML
>>>> models inside Python SDK DoFn's?  Then I suppose it is running some
>>>> predictions rather than model training?
>>>>
>>>> On Thu, Nov 15, 2018 at 1:08 PM Rakesh Kumar <[email protected]>
>>>> wrote:
>>>>
>>>>> I am using *Beam Python SDK *to run my app in production. The app is
>>>>> running machine learning models. I am noticing some memory leak which
>>>>> eventually kills the application. I am not sure the source of memory leak.
>>>>> Currently, I am using object graph
>>>>> <https://mg.pov.lt/objgraph/#memory-leak-example> to dump the memory
>>>>> stats. I hope I will get some useful information out of this. I have also
>>>>> looked into Guppy library <https://pypi.org/project/guppy/> and they
>>>>> are almost the same.
>>>>>
>>>>> Do you guys have any recommendation for debugging this issue? Do we
>>>>> have any tooling in the SDK that can help to debug it?
>>>>> Please feel free to share your experience if you have debugged similar
>>>>> issues in past.
>>>>>
>>>>> Thank you,
>>>>> Rakesh
>>>>>
>>>>
>>>>
>>>> --
>>>> ================
>>>> Ruoyun  Huang
>>>>
>>>>
>>
>> --
>> ================
>> Ruoyun  Huang
>>
>>

Reply via email to