[
https://issues.apache.org/jira/browse/BEAM-3622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Bradshaw updated BEAM-3622:
----------------------------------
Component/s: sdk-py-harness
> DirectRunner memory issue with Python SDK
> -----------------------------------------
>
> Key: BEAM-3622
> URL: https://issues.apache.org/jira/browse/BEAM-3622
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core, sdk-py-harness
> Reporter: yuri krnr
> Assignee: Charles Chen
> Priority: Major
>
> After running pipeline for a while in a streaming mode (reading from Pub/Sub
> and writing to BigQuery, Datastore and another Pub/Sub) I noticed drastic
> memory usage of a process. Using guppy as a profiler I got the following
> results:
> start
> {noformat}
> INFO *** MemoryReport Heap:
> Partition of a set of 240208 objects. Total size = 34988840 bytes.
> Index Count % Size % Cumulative % Kind (class / dict of class)
> 0 88289 37 8696984 25 8696984 25 str
> 1 53333 22 4897352 14 13594336 39 tuple
> 2 5083 2 2790664 8 16385000 47 dict (no owner)
> 3 1939 1 1749656 5 18134656 52 type
> 4 699 0 1723272 5 19857928 57 dict of module
> 5 12337 5 1579136 5 21437064 61 types.CodeType
> 6 12403 5 1488360 4 22925424 66 function
> 7 1939 1 1452616 4 24378040 70 dict of type
> 8 677 0 709496 2 25087536 72 dict of 0x1e4d880
> 9 25603 11 614472 2 25702008 73 int
> <1103 more rows. Type e.g. '_.more' to view.>
> {noformat}
> after several hours of running
> {noformat}
> INFO *** MemoryReport Heap:
> Partition of a set of 1255662 objects. Total size = 315029632 bytes.
> Index Count % Size % Cumulative % Kind (class / dict of class)
> 0 95554 8 99755056 32 99755056 32 dict of
>
> apache_beam.runners.direct.bundle_factory._Bundle
> 1 117943 9 54193192 17 153948248 49 dict (no owner)
> 2 161068 13 27169296 9 181117544 57 unicode
> 3 94571 8 26479880 8 207597424 66 dict of apache_beam.pvalue.PBegin
> 4 126461 10 12715336 4 220312760 70 str
> 5 44374 4 12424720 4 232737480 74 dict of
> apitools.base.protorpclite.messages.FieldList
> 6 44374 4 6348624 2 239086104 76
> apitools.base.protorpclite.messages.FieldList
> 7 95556 8 6115584 2 245201688 78
> apache_beam.runners.direct.bundle_factory._Bundle
> 8 94571 8 6052544 2 251254232 80 apache_beam.pvalue.PBegin
> 9 57371 5 5218424 2 256472656 81 tuple
> <1187 more rows. Type e.g. '_.more' to view.>
> {noformat}
>
> I see that every bundle still sits in memory and all its data too. why aren't
> the gc-ed?
> What is the policy for gc for the dataflow processes?
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)