yuri krnr created BEAM-3622:
-------------------------------

             Summary: DirectRunner memory issue with Python SDK
                 Key: BEAM-3622
                 URL: https://issues.apache.org/jira/browse/BEAM-3622
             Project: Beam
          Issue Type: Bug
          Components: runner-direct, sdk-py-core
            Reporter: yuri krnr
            Assignee: Thomas Groh


After running pipeline for a while in a streaming mode (reading from Pub/Sub 
and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic 
memory usage of a process. Using guppy as a profiler I got the following 
results:

start
{noformat}
 INFO *** MemoryReport Heap:
 Partition of a set of 240208 objects. Total size = 34988840 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  88289  37  8696984  25   8696984  25 str
     1  53333  22  4897352  14  13594336  39 tuple
     2   5083   2  2790664   8  16385000  47 dict (no owner)
     3   1939   1  1749656   5  18134656  52 type
     4    699   0  1723272   5  19857928  57 dict of module
     5  12337   5  1579136   5  21437064  61 types.CodeType
     6  12403   5  1488360   4  22925424  66 function
     7   1939   1  1452616   4  24378040  70 dict of type
     8    677   0   709496   2  25087536  72 dict of 0x1e4d880
     9  25603  11   614472   2  25702008  73 int
<1103 more rows. Type e.g. '_.more' to view.>
{noformat}
after several hours of running
{noformat}
INFO *** MemoryReport Heap:
 Partition of a set of 1255662 objects. Total size = 315029632 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  95554   8 99755056  32  99755056  32 dict of
                                             
apache_beam.runners.direct.bundle_factory._Bundle
     1 117943   9 54193192  17 153948248  49 dict (no owner)
     2 161068  13 27169296   9 181117544  57 unicode
     3  94571   8 26479880   8 207597424  66 dict of apache_beam.pvalue.PBegin
     4 126461  10 12715336   4 220312760  70 str
     5  44374   4 12424720   4 232737480  74 dict of 
apitools.base.protorpclite.messages.FieldList
     6  44374   4  6348624   2 239086104  76 
apitools.base.protorpclite.messages.FieldList
     7  95556   8  6115584   2 245201688  78 
apache_beam.runners.direct.bundle_factory._Bundle
     8  94571   8  6052544   2 251254232  80 apache_beam.pvalue.PBegin
     9  57371   5  5218424   2 256472656  81 tuple
<1187 more rows. Type e.g. '_.more' to view.>
{noformat}
I see that every bundle still sits in memory and all its data too. why aren't 
the gc-ed?

What is the policy for gc for the dataflow processes?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to