Hi folks,
Recently we have been experimenting a really weird memory consumption
pattern using the Python protobuf implementation - using under the
hood the CPP interface. In some specific scenario, we have spotted a
sudden increase in memory usage just by iterating by some of the proto
message attributes.
Some context. The message is composed of three nested repeated fields,
like the following message:
Message Bar {
Message X {
Message Y {
int value = 1
}
repeated Y y = 1
}
repeated X x = 1
}
Message Foo {
repeated Bar bar = 1
}
We have a serialized protobuf file that uses the previous message
format that takes around 1GB in disk, containing around 10M of Bar
messages as repeated elements of one Foo message. We do have a similar
code like the following one:
from foo_pb2 import Foo
with open("/tmp/foo", "rb") as fd:
foo = Foo()
foo.ParseFromString(fd.read())
for bar in foor.bar:
pass
for bar in foo.bar:
for x in bar.x:
for y in x.y:
pass
We have noticed that after the first iteration - that basically
iterates for all of the repeated Bar elements within the Foo object -
the memory increases till reaching the 16GB. And, after the second
loop, the memory increases almost till the 30GB.
Besides the amount of memory consumed, what really surprised us was
the increase of the memory footprint just because of a simple
iteration, we were wondering if we could find out a memory leak. But
was quite unlikely taking into account the maturity of the project. We
were digging a bit into the C extension implementation, and we found
out something interesting. Reading this piece of code [1], which
refers to the `getattribute` method, seems that Python objects are
lazily created, so only when they are accessed are in fact created.
Is this true? is there a lazy loading pattern that only creates the
Python objects if and only if these are accessed?
And in that case, can this be circumvented in some way? if we do not
need to mutate the attribute can we make direct access to the
underlying object without paying the cost of deserializing it?
I forgot to call out that we are using the 3.6.X version of protobuf,
I can see that in master the "message.cc" implementation has changed a
bit, is there anything in the master or in the 3.7.X version that
might help us in reducing the memory footprint?
Thanks,
[1]
https://github.com/protocolbuffers/protobuf/blob/3.6.x/python/google/protobuf/pyext/message.cc#L2732
--
--pau
--
You received this message because you are subscribed to the Google Groups
"Protocol Buffers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/protobuf.
For more options, visit https://groups.google.com/d/optout.