On Tue, Jan 27, 2009 at 4:49 AM, Jake McGuire <j...@youtube.com> wrote: > Instance attribute names are normally interned - this is done in > PyObject_SetAttr (among other places). Unpickling (in pickle and cPickle) > directly updates __dict__ on the instance object. This bypasses the > interning so you end up with many copies of the strings representing your > attribute names, which wastes a lot of space, both in RAM and in pickles of > sequences of objects created from pickles. Note that the native python > memcached client uses pickle to serialize objects. > >>>> import pickle >>>> class C(object): > ... def __init__(self, x): > ... self.long_attribute_name = x > ... >>>> len(pickle.dumps([pickle.loads(pickle.dumps(C(None), >>>> pickle.HIGHEST_PROTOCOL)) for i in range(100)], pickle.HIGHEST_PROTOCOL)) > 3658 >>>> len(pickle.dumps([C(None) for i in range(100)], >>>> pickle.HIGHEST_PROTOCOL)) > 1441 >>>> > > Interning the strings on unpickling makes the pickles smaller, and at least > for cPickle actually makes unpickling sequences of many objects slightly > faster. I have included proposed patches to cPickle.c and pickle.py, and > would appreciate any feedback. > > dhcp-172-31-170-32:~ mcguire$ diff -u > Downloads/Python-2.4.3/Modules/cPickle.c cPickle.c > --- Downloads/Python-2.4.3/Modules/cPickle.c 2004-07-26 > 22:22:33.000000000 -0700 > +++ cPickle.c 2009-01-26 23:30:31.000000000 -0800 > @@ -4258,6 +4258,8 @@ > PyObject *state, *inst, *slotstate; > PyObject *__setstate__; > PyObject *d_key, *d_value; > + PyObject *name; > + char * key_str; > int i; > int res = -1; > > @@ -4319,8 +4321,24 @@ > > i = 0; > while (PyDict_Next(state, &i, &d_key, &d_value)) { > - if (PyObject_SetItem(dict, d_key, d_value) < 0) > - goto finally; > + /* normally the keys for instance attributes are > + interned. we should try to do that here. */ > + if (PyString_CheckExact(d_key)) { > + key_str = PyString_AsString(d_key); > + name = PyString_FromString(key_str); > + if (! name) > + goto finally; > + > + PyString_InternInPlace(&name); > + if (PyObject_SetItem(dict, name, d_value) < > 0) { > + Py_DECREF(name); > + goto finally; > + } > + Py_DECREF(name); > + } else { > + if (PyObject_SetItem(dict, d_key, d_value) < > 0) > + goto finally; > + } > } > Py_DECREF(dict); > } > > dhcp-172-31-170-32:~ mcguire$ diff -u Downloads/Python-2.4.3/Lib/pickle.py > pickle.py > --- Downloads/Python-2.4.3/Lib/pickle.py 2009-01-27 > 01:41:43.000000000 -0800 > +++ pickle.py 2009-01-27 01:41:31.000000000 -0800 > @@ -1241,7 +1241,15 @@ > state, slotstate = state > if state: > try: > - inst.__dict__.update(state) > + d = inst.__dict__ > + try: > + for k,v in state.items(): > + d[intern(k)] = v > + # keys in state don't have to be strings > + # don't blow up, but don't go out of our way > + except TypeError: > + d.update(state) > + > except RuntimeError: > # XXX In restricted execution, the instance's __dict__ > # is not accessible. Use the old way of unpickling >
Hi Jake, You should really post this to bugs.python.org as an enhancement so we can track the discussion there. -jesse _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com