On Fri, Sep 23, 2016 at 7:05 PM, Christian <mining.fa...@gmail.com> wrote: > I'm wondering why python blow up a dictionary structure so much. > > The ids and cat substructure could have 0..n entries but in the most cases > they are <= 10,t is limited by <= 6. > > Example: > > {'0a0f7a3a0e09826caef1bff707785662': {'ids': > {'aa316b86-8169-11e6-bab9-0050563e2d7c', > 'aa3174f0-8169-11e6-bab9-0050563e2d7c', > 'aa319408-8169-11e6-bab9-0050563e2d7c', > 'aa3195e8-8169-11e6-bab9-0050563e2d7c', > 'aa319732-8169-11e6-bab9-0050563e2d7c', > 'aa319868-8169-11e6-bab9-0050563e2d7c', > 'aa31999e-8169-11e6-bab9-0050563e2d7c', > 'aa319b06-8169-11e6-bab9-0050563e2d7c'}, > 't': {'type1', 'type2'}, > 'dt': datetime.datetime(2016, 9, 11, 15, 15, 54, 343000), > 'nids': 8, > 'ntypes': 2, > 'cat': [('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', ''), > ('ABC','aa3174f0-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > ('ABC','aa319408-8169-11e6-bab9-0050563e2d7c','3', 'type1'), > ('ABC','aa3195e8-8169-11e6-bab9-0050563e2d7c', '3', 'type2'), > ('ABC','aa319732-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > ('ABC','aa319868-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > ('ABC','aa31999e-8169-11e6-bab9-0050563e2d7c', '3', 'type1'), > ('ABC','aa319b06-8169-11e6-bab9-0050563e2d7c', '3', 'type2')]}, > > > sys.getsizeof(superdict) > 50331744 > len(superdict) > 941272
So... you have a million entries in the master dictionary, each of which has an associated collection of data, consisting of half a dozen things, some of which have subthings. The very smallest an object will ever be on a 64-bit Linux system is 16 bytes: >>> sys.getsizeof(object()) 16 and most of these will be much larger: >>> sys.getsizeof(8) 28 >>> sys.getsizeof(datetime.datetime(2016, 9, 11, 15, 15, 54, 343000)) 48 >>> sys.getsizeof([]) 64 >>> sys.getsizeof(('ABC', 'aa316b86-8169-11e6-bab9-0050563e2d7c', '74', '')) 80 >>> sys.getsizeof('aa316b86-8169-11e6-bab9-0050563e2d7c') 85 >>> sys.getsizeof({}) 240 (Bear in mind that sys.getsizeof counts only the object itself, not the things it references - that's why the tuple can take up less space than one of its members.) I don't think your collections can average less than about 1KB (even the textual representation of your example data is about that big), and you have a million of them. That's a gigabyte of memory, right there. Your peak memory usage is showing 3GB, so most likely, my conservative estimates have put an absolute lower bound on this. Try doing everything exactly the same as you did, only without actually loading the pickle - then see what memory usage is. I think you'll find that the usage is fully legitimate. > Thanks for any advice to save memory. Use a database. I suggest PostgreSQL. You won't have to load everything into memory all at once that way, and (bonus!) you can even update stuff on disk without rewriting everything. ChrisA -- https://mail.python.org/mailman/listinfo/python-list