Hi, Just as a follow up we have found a quite weird behavior memory footprint, seems that when the objects are iterated the deallocation does not return all of the memory back to the process [1].
Any thoughts? [1] https://github.com/protocolbuffers/protobuf/issues/5737 On Thu, Feb 14, 2019 at 10:47 PM Pau Freixes <[email protected]> wrote: > > Hi folks, > > Recently we have been experimenting a really weird memory consumption > pattern using the Python protobuf implementation - using under the > hood the CPP interface. In some specific scenario, we have spotted a > sudden increase in memory usage just by iterating by some of the proto > message attributes. > > Some context. The message is composed of three nested repeated fields, > like the following message: > > Message Bar { > Message X { > Message Y { > int value = 1 > } > repeated Y y = 1 > } > repeated X x = 1 > } > Message Foo { > repeated Bar bar = 1 > } > > We have a serialized protobuf file that uses the previous message > format that takes around 1GB in disk, containing around 10M of Bar > messages as repeated elements of one Foo message. We do have a similar > code like the following one: > > from foo_pb2 import Foo > > with open("/tmp/foo", "rb") as fd: > foo = Foo() > foo.ParseFromString(fd.read()) > > for bar in foor.bar: > pass > > for bar in foo.bar: > for x in bar.x: > for y in x.y: > pass > > We have noticed that after the first iteration - that basically > iterates for all of the repeated Bar elements within the Foo object - > the memory increases till reaching the 16GB. And, after the second > loop, the memory increases almost till the 30GB. > > Besides the amount of memory consumed, what really surprised us was > the increase of the memory footprint just because of a simple > iteration, we were wondering if we could find out a memory leak. But > was quite unlikely taking into account the maturity of the project. We > were digging a bit into the C extension implementation, and we found > out something interesting. Reading this piece of code [1], which > refers to the `getattribute` method, seems that Python objects are > lazily created, so only when they are accessed are in fact created. > > Is this true? is there a lazy loading pattern that only creates the > Python objects if and only if these are accessed? > > And in that case, can this be circumvented in some way? if we do not > need to mutate the attribute can we make direct access to the > underlying object without paying the cost of deserializing it? > > I forgot to call out that we are using the 3.6.X version of protobuf, > I can see that in master the "message.cc" implementation has changed a > bit, is there anything in the master or in the 3.7.X version that > might help us in reducing the memory footprint? > > Thanks, > > [1] > https://github.com/protocolbuffers/protobuf/blob/3.6.x/python/google/protobuf/pyext/message.cc#L2732 > -- > --pau -- --pau -- You received this message because you are subscribed to the Google Groups "Protocol Buffers" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/protobuf. For more options, visit https://groups.google.com/d/optout.
