[jira] [Commented] (BEAM-3622) DirectRunner memory issue with Python SDK

yuri krnr (JIRA) Tue, 13 Feb 2018 23:13:40 -0800

    [ 
https://issues.apache.org/jira/browse/BEAM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16363576#comment-16363576
 ]


yuri krnr commented on BEAM-3622:
---------------------------------

cool! will be looking forward for the results

> DirectRunner memory issue with Python SDK
> -----------------------------------------
>
>                 Key: BEAM-3622
>                 URL: https://issues.apache.org/jira/browse/BEAM-3622
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: yuri krnr
>            Assignee: Charles Chen
>            Priority: Major
>
> After running pipeline for a while in a streaming mode (reading from Pub/Sub 
> and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic 
> memory usage of a process. Using guppy as a profiler I got the following 
> results:
> start
> {noformat}
>  INFO *** MemoryReport Heap:
>  Partition of a set of 240208 objects. Total size = 34988840 bytes.
>  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0  88289  37  8696984  25   8696984  25 str
>      1  53333  22  4897352  14  13594336  39 tuple
>      2   5083   2  2790664   8  16385000  47 dict (no owner)
>      3   1939   1  1749656   5  18134656  52 type
>      4    699   0  1723272   5  19857928  57 dict of module
>      5  12337   5  1579136   5  21437064  61 types.CodeType
>      6  12403   5  1488360   4  22925424  66 function
>      7   1939   1  1452616   4  24378040  70 dict of type
>      8    677   0   709496   2  25087536  72 dict of 0x1e4d880
>      9  25603  11   614472   2  25702008  73 int
> <1103 more rows. Type e.g. '_.more' to view.>
> {noformat}
> after several hours of running
> {noformat}
> INFO *** MemoryReport Heap:
>  Partition of a set of 1255662 objects. Total size = 315029632 bytes.
>  Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
>      0  95554   8 99755056  32  99755056  32 dict of
>                                              
> apache_beam.runners.direct.bundle_factory._Bundle
>      1 117943   9 54193192  17 153948248  49 dict (no owner)
>      2 161068  13 27169296   9 181117544  57 unicode
>      3  94571   8 26479880   8 207597424  66 dict of apache_beam.pvalue.PBegin
>      4 126461  10 12715336   4 220312760  70 str
>      5  44374   4 12424720   4 232737480  74 dict of 
> apitools.base.protorpclite.messages.FieldList
>      6  44374   4  6348624   2 239086104  76 
> apitools.base.protorpclite.messages.FieldList
>      7  95556   8  6115584   2 245201688  78 
> apache_beam.runners.direct.bundle_factory._Bundle
>      8  94571   8  6052544   2 251254232  80 apache_beam.pvalue.PBegin
>      9  57371   5  5218424   2 256472656  81 tuple
> <1187 more rows. Type e.g. '_.more' to view.>
> {noformat}
>  
> I see that every bundle still sits in memory and all its data too. why aren't 
> the gc-ed?
> What is the policy for gc for the dataflow processes?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (BEAM-3622) DirectRunner memory issue with Python SDK

Reply via email to