On Fri, Mar 6, 2009 at 10:01 AM, Michael Haggerty <mhag...@alum.mit.edu> wrote: > Antoine Pitrou wrote: >> Le vendredi 06 mars 2009 à 13:44 +0100, Michael Haggerty a écrit : >>> Antoine Pitrou wrote: >>>> Michael Haggerty <mhagger <at> alum.mit.edu> writes: >>>>> It is easy to optimize the pickling of instances by giving them >>>>> __getstate__() and __setstate__() methods. But the pickler still >>>>> records the type of each object (essentially, the name of its class) in >>>>> each record. The space for these strings constituted a large fraction >>>>> of the database size. >>>> If these strings are not interned, then perhaps they should be. >>>> There is a similar optimization proposal (w/ patch) for attribute names: >>>> http://bugs.python.org/issue5084 >>> If I understand correctly, this would not help: >>> >>> - on writing, the strings are identical anyway, because they are read >>> out of the class's __name__ and __module__ fields. Therefore the >>> Pickler's usual memoizing behavior will prevent the strings from being >>> written more than once. >> >> Then why did you say that "the space for these strings constituted a >> large fraction of the database size", if they are already shared? Are >> your objects so tiny that even the space taken by the pointer to the >> type name grows the size of the database significantly? > > Sorry for the confusion. I thought you were suggesting the change to > help the more typical use case, when a single Pickler is used for a lot > of data. That use case will not be helped by interning the class > __name__ and __module__ strings, for the reasons given in my previous email. > > In my case, the strings are shared via the Pickler memoizing mechanism > because I pre-populate the memo (using the API that the OP proposes to > remove), so your suggestion won't help my current code, either. It was > before I implemented the pre-populated memoizer that "the space for > these strings constituted a large fraction of the database size". But > your suggestion wouldn't help that case, either. > > Here are the main use cases: > > 1. Saving and loading one large record. A class's __name__ string is > the same string object every time it is retrieved, so it only needs to > be stored once and the Pickler memo mechanism works. Similarly for the > class's __module__ string. > > 2. Saving and loading lots of records sequentially. Provided a single > Pickler is used for all records and its memo is never cleared, this > works just as well as case 1. > > 3. Saving and loading lots of records in random order, as for example in > the shelve module. It is not possible to reuse a Pickler with retained > memo, because the Unpickler might not encounter objects in the right > order. There are two subcases: > > a. Use a clean Pickler/Unpickler object for each record. In this > case the __name__ and __module__ of a class will appear once in each > record in which the class appears. (This is the case regardless of > whether they are interned.) On reading, the __name__ and __module__ are > only used to look up the class, so interning them won't help. It is > thus impossible to avoid wasting a lot of space in the database. > > b. Use a Pickler/Unpickler with a preset memo for each record (my > unorthodox technique). In this case the class __name__ and __module__ > will be memoized in the shared memo, so in other records only their ID > needs to be stored (in fact, only the ID of the class object itself). > This allows the database to be smaller, but does not have any effect on > the RAM usage of the loaded objects. > > If the OP's proposal is accepted, 3b will become impossible. The > technique seems not to be well known, so maybe it doesn't need to be > supported. It would mean some extra work for me on the cvs2svn project > though :-(
Talking it over with Guido, support for the memo attribute will have to stay. I shall add it back to my patches. Collin _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com